[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NetCDF Packing



John,

> I assume that bit shifting and comparision, etc. is fast.  In ieee floating
> point, you use "illegal" exponent values to signal exceptional numbers. So
> we need to decide on an exponent scheme.

I think a simple biased exponent scheme in which the last exponent (all 1
bits) is used for exceptional values such as the _FillValue might be OK.

Unfortunately, it would require using one of a small number of exponent
values to represent only one special value, if there is a _FillValue.  This
is sort of wasteful of space in the case that the exponent range is an exact
power of 2, since then it requires an extra bit that is used only for this
special value.  But it has the virtue of simplicity, and I'm willing to see
if this yields adequate packing in most cases.

> > Your idea of compressing the exponents (and perhaps the sign bits)
> > separately from the mantissas is a good one, but in my opinion we can't
> > use it for netCDF data access.  As you have pointed out, it makes direct
> > access to a single unpacked value depend on first unpacking all the
> > values for a variable.  Similarly, it makes writing values in a
> > different order from the order in which they are stored or from the
> > order in which they will be read difficult.  It also imposes
> > requirements for large memory allocations on applications that may only
> > need one value for each slice of a variable with very large slices.
> 
> I'm not convinced about this yet.  I would say that the current, unpacked
> design is a reasonable way to allow "direct access to a single value" and
> "writing values in a different order from the order in which they are
> stored", etc.  Now whats motivating packing?  Basically, certain efficiency
> considerations for very large datasets. So the design needs to answer those
> efficiency considerations, but not necessarily all the original design
> goals need remain intact. Ideally, we'd like to have some options that
> allow these tradeoffs to be made by the application layer.

I guess I disagree.  A fundamental characteristic of the netCDF API is the
ability to read general array cross sections, to access small subsets of
large datasets efficiently.  This is used in data visualization applications
supported by netCDF and was the main reason we chose to implement netCDF
data access in terms of direct access (seeks) instead of sequential access
(read next).  Giving up this feature is too high a price to pay for the
addition of (semi-)transparent packing.  

Ideally, applications that read netCDF data should not have to know whether
it's packed or not.  That should be determined by the writer at variable
definition time, but an application that wants to read a small subset of a
large dataset should not pay the penalty of much slower access or having to
malloc a Gbyte of space just because the data happens to have been packed.
The access time and memory space used to read a small subset of data may
depend on the order in which the data is written (which the reader specifies
and can predict from a variable declaration), but they should be
proportional to the size of the data subset requested, not the size of the
dataset out of which the data is extracted.

> Its instructive to consider what the ccm2 people have now, and compare that
> to possible netcdf extensions.  They have certainly given up "direct access
> to a single value" and "writing values in a different order from the order
> in which they are stored".  Indeed, they have probably optimized writing
> datasets as opposed to reading them (I am guessing at the read access
> pattern), which is probably the wrong thing to do from a long-term point of
> view (assume many reads per write).
> 
> Anyway, they use "latitude slices" as their basic array, which is nlon x
> nlev.  Then they have a separate scale/offset for each level, so they pack
> nlon values at a time.  nlon is typically 128.  The read accesses that seem
> most common to me are horizontal slices (nlon x nlat) or zonal slices (nlat
> x nlev).  So both read accesses need a lot more disk accesses than optimal,
> though in the first case, you dont actually have to unpack anything you
> dont need.  Also, they have all fields for one lat slice together, and I
> assume that read accesses more often deal with one field over the entire
> volume, at least for visualization.  However, it may be a good way to do it
> for scientific processing, where you need to calculate using many of the
> fields at a certain location.
> 
> So you might imagine a design that gives ccm2 the ability to store data in
> the way they are already doing it, so that the current read and write
> (in)efficiencies are preserved, but with the advantages of the netcdf API
> and machine independence.

And that should be possible by specifying the order of dimensions and the
packing parameters (number of bits precision, range of values) for each
netCDF variable when it is defined.  But I'm not convinced that the benefits
of the extra compression you get by giving up on the ability to read small
subsets of the data efficiently are worth the cost.  If this means the CCM2
can't use netCDF because its storage scheme is not optimized enough for that
particular application, its developers will have to live with the other
trade-offs involved in developing an application-specific interface and
format that is more optimally suited to that purpose.

If, however, the problem you have described for the CCM2 is just an example
of a more general problem that occurs over and over in scientific data
access, I might be convinced of the importance of developing a solution for
optimal-compression at the expense of the performance of read access.
Currently, I'm leaning toward preserving this fundamental property of netCDF
access for the thousands of other users who are satisfied with even the
current meager packing capabilities.

> More generally, you might imagine some way to allow users to specify
> implementation (storage strategy), with no change in API (obvious change in
> efficiency of API), with high level tools to reorganize a file without the
> interface looking any different (like DB managers do).  I assume that will
> be the thrust of the HDF implementation of the "data chunking" idea.  Data
> compression might fall into this category.

I'll have to think about this.  DB managers generally base changes in the
storage that preserve the database schema on usage and access patterns,
because the patterns and frequencies of database transactions and queries
can't necessarily be anticipated when a database is created.  This is also
true of scientific data, but caching and multi-level storage hierarchies
have been the main way to address this in the past.  This may be a laudable
goal, but it seems like it would be difficult to implement.

> Or perhaps netcdf should stay lean and clean, and these complexities be
> implemented in a larger system like HDF, which seems to have a lot of
> funding?  I dont know what your vision of netcdf is, and its relation to
> other systems.  With hdf having a netcdf interface, one could argue that
> large datasets should move to hdf, and netcdf remain a small,
> understandable system.
> 
> > I'm still hoping we can work out the details of a packed floating-point
> > representation such as you have suggested, because I think it's superior to
> > my idea of using arrays of scales and offsets.  Please let me know if you
> > have any other thoughts on this.
> 
> Perhaps you could give me a thumbnail sketch of your "array of scales and
> offsets" design, so I can think about it concretely.  I remain undecided as
> to the advantages of scale and offset vs small floating point.

I'll have to send this tomorrow.

--Russ