[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NetCDF Packing



John,

> Or perhaps netcdf should stay lean and clean, and these complexities be
> implemented in a larger system like HDF, which seems to have a lot of
> funding?  I dont know what your vision of netcdf is, and its relation to
> other systems.  With hdf having a netcdf interface, one could argue that
> large datasets should move to hdf, and netcdf remain a small,
> understandable system.

I definitely don't want to add too many "bells and whistles" to netCDF in an
attempt to satisfy specialized needs.  I want to keep the surface area of
the interface small and the User's Guide short enough that it is not
intimidating.  I also want to make sure any new features don't impose a cost
on those who don't use them or need them.  I think packing is currently the
single feature that we're getting the most pressure to add, but almost all
of that pressure is coming from NCAR users rather the majority of uses in
various earth science and other disciplines.  Nevertheless, I think the
addition of (mostly) transparent packing is important and would make
netCDF more useful.

As far as vision goes, one thing I would like to be working on is
improvements to the C++ interface, perhaps even netCDF iterators that fit in
with the new Standard Template Library.  This would make available to netCDF
programmers some powerful and easy-to-use tools for looping through,
selecting, searching, sorting, ... of netcdf objects.

> > I'm still hoping we can work out the details of a packed floating-point
> > representation such as you have suggested, because I think it's superior to
> > my idea of using arrays of scales and offsets.  Please let me know if you
> > have any other thoughts on this.
> 
> Perhaps you could give me a thumbnail sketch of your "array of scales and
> offsets" design, so I can think about it concretely.  I remain undecided as
> to the advantages of scale and offset vs small floating point.

OK, although I haven't worked out the precise additions to the C and Fortran
interface that would be required.  Some of these details would apply to the
use of packed floating-point as well as scale and offset arrays.

My idea was that the packing parameters for a variable would be set up at
variable definition time.

Readers would not have to be aware of whether a variable had been set up as
packed or not, but could find out the packing parameters by calling a
suitable inquire function for the variable.  A writer would get an error
returned if it tried to write a value inconsistent with the packing
parameters for a variable, but otherwise wouldn't have to know that a
variable was packed.

All netCDF types would permit packed representations, so three-bit ints and
booleans could be stored efficiently even if they are declared to be of type
NC_BYTE.

The number of bits (Nbits) would be a scalar packing parameter for a
variable, so you couldn't use 10 bits for one cross-section and 6 bits for a
different cross-section of the same array.  The other two packing
parameters, Scale and Offset, could be multi-dimensional arrays using some
subset (including the empty subset for scalar packing constants) of the
dimensions of a variable.  For example,

   float T(time, level, lat, lon)

could have Scale(level) and Offset(level), to exploit the fact that
temperatures at a given atmospheric level may have a smaller range (and
hence be packed better) than global temperatures at all levels.  You might
also use the lat dimension as a packing dimension, Scale(level,lat) and
Offset(level,lat) would be 2-d packing arrays set up for the variable T.

The Nbits, Scale, and Offset parameters must be defined for a variable
before any values have been written for that variable (including
_FillValues) and must not be redefined with different values after any
values (including _FillValues) have been written.

0 <= Nbits <= 32.  If Nbits is 0, no data needs to be stored, and
this variable is only a handle for attributes.  In this case, the
variables value on a read is the _FillValue.  It is not possible
to store more than 32 bits of precision, even for double values,
[because of the restrictions of XDR?].  Providing a value of Nbits
greater than 16 for a NC_SHORT variable or greater than 8 for an
NC_CHAR or NC_BYTE variable is not useful.

One value of the packed range will be used for the representation of
the packed _FillValue, so the packed values will represent 

    2^Nbits - 1 

distinct data values.  

The Offset parameter should be of the same type as the variable.  A useful
value of a scalar Offset is the minimum valid data value, so that all packed
data will be non-negative.

The Scale parameter should be of type double [or float?].  A useful value of
Scale in the case that data values map to the integers 0, 1, ..., 2^Nbits-2
and the missing value maps to 2^Nbits-1 is:

        (Max - Min)/(2^Nbits - 2)

assuming the packing formulas are:

        packed = truncate_toNbits( (value - Offset) / Scale )

        value = packed * Scale + Offset

The _FillValue will be mapped into the packed range to 2^Nbits-1.  The
_FillValue (and valid_range, valid_min, or valid_max) parameters should
always be specified in terms of the unpacked values of a variable.

--Russ