[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NetCDF Packing



> Organization: NCAR/ACD
> Keywords: 199503142139.AA05870

Hi John,

> > I think a simple biased exponent scheme in which the last exponent (all
> > 1 bits) is used for exceptional values such as the _FillValue might be
> > OK.  Unfortunately, it would require using one of a small number of
> > exponent values to represent only one special value, if there is a
> > _FillValue.  This is sort of wasteful of space in the case that the
> > exponent range is an exact power of 2, since then it requires an extra
> > bit that is used only for this special value.  But it has the virtue of
> > simplicity, and I'm willing to see if this yields adequate packing in
> > most cases.
> 
> yes, it is a waste to lose a factor of 2 in range for this. There must be
> a simple way to use only one out of 2**n values for this. It seems if you
> want to reserve, say, all 1's as the special value, you just need to find
> an alternate mapping for that "legitimate" number, or you might say that
> that particular number is simply not representable. After all, once you
> restrict the exponent range, there's plenty of numbers not
> representable. Possibly you find the correct exponent bias?

Yes, it would be better to just appropriate one value out of the range for
the special value rather than an exceptional exponent, as it's done in IEEE,
where there are lots of different kinds of NaNs.  I was thinking it would be
nice to be able to just use the top m bits of the existing mantissa, but
that's not portable and won't work if we want to not waste a bit on the
special value.

> I'm willing to take as a design goal "to access small subsets of large
> datasets efficiently", if that's your priority. I just wanted to point out
> that that's not always the application's priority, judging by the actual (eg
> ccm2) file layout people are using. I also want to argue that exponent
> packing does not necessarily degrade that access unacceptably. For example,
> suppose someone was extracting a small subset that was scattered through the
> file, causing many disk reads. Its likely that done correctly (i.e not
> causing extra disk reads), the exponent decompression would be negligible
> compared to the disk access time, even if you had to decompress 100 numbers
> to get one that you wanted. So i'm guessing that if the size of the
> compressed blocks was optimized for disk access, you might not lose too much
> efficiency.
> 
> Anyway, i'm not particularly convinced about this, but I am trying to
> describe all the possible solutions I can think of. 

You're probably right about compression being a performance win for
reasonably sized variables.  I'm mostly worried about the extreme cases
where the library would be forced to malloc() a huge chunk of memory in
which to decompress a slab of compressed data to return a small number of
values.  There's a tradeoff between how well the data can be compressed and
how much complexity will be required to handle these extreme cases well.  I
realize some applications may have different priorities that will make any
tradeoff we choose suboptimal.

> So as far as I've thought, the options are:
>       1) store mantissa as fixed # of bits, compress exponents seperately
>       2) store mantissa and exponent as fixed # of bits. 
>         2a) select exponent bias for each packed segment to minimize size 
>             of exponent.
>         2b) fixed exponent bias (for each field?).
>       3) scale and offset

Right, I've mostly been considering 2b) and 3) as the leading candidates
because of the priority to access small subsets of large packed fields
efficiently and because of simplicity of implementation (which we have to
worry about, given a scarcity of resources for development and maintenance).
I had been considering some suboptions

        3) scale and offset
          3a) a single scale and offset for each field (variable)
          3b) permitting a vector or array of scales and offsets for each 
              field

but I'm still leaning toward 2b) because it preserves more relative accuracy
in the same number of bits, it adapts well to data where the variation in
range is not along one of the dimensions, and it requires a fixed and small
number of packing parameters.  I hadn't considered 2a), and am not really
sure what a "packed segment" means.  It might take a lot of bits and a fair
amount of complexity to specify regions of data in a multidimensional space
where the exponents cluster.

I agree with you that 1) yields better compression than the other options,
but I really want to avoid the issues of garbage collection that always seem
to arise when there are variable-length pieces that may be rewritten or
appended to.  The netCDF data model permits overwriting data with new
values, and if this either frees up space in compressed exponent storage or
requires more exponent storage because the exponents no longer compress so
well, there is a potential for considerable additional complexity.

> Here is a little test I made on one particular ccm2 history file. This
> was packing things into 16 bits. I looked at the packed values and binned
> the highest order bit:
> 
> Bin results
> Bin  0:   13034861  ===  15.21 %
> Bin  1:     402151  ===   0.47 %
> Bin  2:     411094  ===   0.48 %
> Bin  3:     483981  ===   0.56 %
> Bin  4:     578127  ===   0.67 %
> Bin  5:     687292  ===   0.80 %
> Bin  6:     834086  ===   0.97 %
> Bin  7:    1034717  ===   1.21 %
> Bin  8:    1278646  ===   1.49 %
> Bin  9:    1616691  ===   1.89 %
> Bin 10:    2121698  ===   2.48 %
> Bin 11:    2849600  ===   3.33 %
> Bin 12:    3962666  ===   4.63 %
> Bin 13:    5695290  ===   6.65 %
> Bin 14:    8553184  ===   9.98 %
> Bin 15:   22957854  ===  26.80 %
> Bin 16:   19169998  ===  22.38 %
> Bin 17:          0  ===   0.00 %
> 
> total count = 85671936
> 
> So almost half the numbers are within a factor of 4 of the maximum, after
> subtracting the offset. I dont know if the large number of zeroes is 
> typical.

Although it's interesting, I'm not sure whether this data helps justify one
option over another.  Apparently this data already uses scale / offset,
since it's been converted into integers, so there is almost no precision in
the smaller values.

> Reaction to your scale and offset design:
> 
> One of the things that is occuring to me is the conflict between the
> general-purpose netcdf API, where you can read and write variables in
> arbitrary hyperslabs, and more constrained and therefore possibly efficient 
> access methods.  In this case, an "archival" file basically write-once and 
> read-many, potentially very large. The biggest problem with these files is
> first of all size, and efficiency. Second is portability.  
> 
> My prejudice is therefore to design an "archival-packed" file mode, which
> sacrifices some hyperslab access performance. It should be disk page
> oriented, so store packing parameters on same page as packed data.  It
> should be oriented towards "write-once/extend": if the data has to be
> changed, its ok to incur considerable expense. If the core netcdf library
> implemented variable blobs, where each blob is the packed chunk, you could
> hook in various packing routines to code/decode the blob.
>
> (That would mean you dont have to decide once and for all what is the best
> packing scheme.)  It therefore does and should make explicit which
> dimensions can be most efficiently extracted: namely the dimension over
> which the data is packed.
> 
> I know this runs counter to your original design goals. But my instinct is
> to have a small extension that optimizes a very specific need, rather than a
> more general design that may be less optimal and/or more confusing.

I think that we have to continue to support netCDF as a "write-many/extend"
interface.  Adding the constraint that packed data can't be overwritten is
possible, but makes the fact that data is packed less transparent and is a
limitation I'd prefer to avoid, if possible.

Variable blobs are already possible with netCDF, by just declaring a
variable to be a 1-dimensional array of type NC_BYTE.  Special purpose
interfaces can be built on top of this that provide optimal compression,
while still permitting named attributes to be stored with the data.  One
limitation of this approach is the fact that there is only one unlimited
dimension along which such blobs can be extended, so storing lots of
variable blobs in the same file doesn't work well.  But if variable-size
blobs are rewritable, we get back to the need for garbage collection, which
I want to avoid if possible.  So it may be that netCDF is just not
appropriate for applications that really require multiple extendible blobs
stored in the same file.  HDF may be more appropriate for these
requirements.

> Whatever your decision on how to go about this, feel free to run any ideas
> or designs by me.  If you find it useful to continue brainstorming, perhaps
> we should do it in person. In any case, it might be better to clarify design
> goals, then come back to implementation.

This discussion has been very helpful to me in clarifying the design goals
and constraints.  I'll take you up on your generous offer to review the
suitability of any design we come up with against your requirements.  When
that happens, I'd like to arrange a meeting.  I'm in the middle of writing
programmer performance appraisals and then will be taking off for Spring
Break, but in 3 or 4 weeks we may have a proposed interface.

Thanks again for all your feedback and good ideas.

--Russ