[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: netCDF packing



> Organization: NCAR / CGD
> Keywords: 199502010112.AA17676

Hi Brian,

> I work with Phil Rasch as a programmer and he has asked me to send you the
> following proposal for an extended packing facility for netCDF.  Since we
> are aware that you guys are thinking about extending the packing
> capabilities of netCDF, we wanted to provide input on the needs of a model
> like the Climate System Model currently under development at NCAR.

Thanks for sending this proposal and specifying your needs to this level of
detail.  Other netCDF users have also suggested that if we implement
transparent packing and unpacking in the library, we generalize it enough so
that efficient packing is possible when it is known that different array
cross-sections can benefit from different packing parameters.

Before designing additional netCDF interfaces for defining and accessing
packed data, I'd like to be convinced that existing mechanisms and
interfaces are insufficient.  In particular, I suspect there is a practical
way to represent the packing parameters as variable attributes, rather than
using a separate interface for defining packed variables.  A limitation of
variable attributes is that they can only be one-dimensional vectors, and
some generalized packing algorithms require multidimensional packing
parameters.  But an attribute can name other multi-dimensional variables
that contain multi-dimensional packing parameters.  Another option is a
convention for naming associated variables that contain packing parameters
for an array.  Others have suggested that one-dimensional packing parameters
stored as reserved attributes of variables would be sufficient for the great
majority of cases where packing parameters vary only along a single
dimension.

A basic and desirable feature of netCDF access is the ability to read data
from arrays in a different order than it was written, and in particular the
ability to write or read array cross-sections of any appropriate rank in any
direction and order.  I think it is important that any proposed addition of
transparent packing should preserve this feature, even for packed data.

To support "transparent" packing, readers of data should not have to know
whether or how the data is packed, but this information should be available
if needed.

It would help us a great deal if someone had the resources to implement one
or more packing proposals as sets of interfaces above the netCDF layer using
only existing netCDF interfaces, mechanisms, and conventions.  This would
make it easier to evaluate the benefits of moving the packing into the
library layer, plus many of the implementation and interface issues would
have already been dealt with.

This is a very good time to have your proposal in hand, since we are nearly
to the point of making a list of what is needed and practical for the next
netCDF release.  We will have to consider the difficulty of implementing
various proposed generalizations of simple packing while keeping the
interface simple.  When we get closer to making decisions about priorities
of what to implement, I'll solicit feedback about this and other requests.

Thanks again for your proposal.

--Russ

> =============================================================================
> 
> A Proposal for an Extended Data Packing Capability for NetCDF.
> 
> One of the important issues that has arisen from a study of the feasibility
> of using netCDF as the output format for NCAR's new climate system model
> (CSM) is that of data packing.  This note contains a brief discussion of the
> current data packing capabilities of netCDF and the current data packing
> procedure used in NCAR's community climate model (CCM2) which will be a 
> major component of the CSM.  We continue by suggesting how the data packing
> capability of netCDF could be extended to better suit the needs of the
> CSM.
> 
> The type of packing currently enabled by the netCDF package is the
> following.  Data size may be reduced by mapping the range of the data to be
> packed into the range of the values of some specified length integer.  In
> addition to storing the integer data, an additive offset and scale factor
> must be saved in order to convert the integer data back to the original
> data range.  NetCDF enables this type of packing scheme by the following
> convention.  If a variable has the attributes add_offset and scale_factor
> defined, then an application should unpack the data by first multiplying
> the variable by scale_factor and then adding add_offset.  The type of the
> add_offset and scale_factor attributes should be the same as the desired
> type of the unpacked data.  These conventions leave it up to the user to
> pack the data, and to unpack it too if not using an application that does
> the unpacking.
> 
> The current method of packing data for CCM2 output uses the same type of
> scaling scheme described above, but rather than applying a single mapping
> to a whole 3-D field of data, the data is packed with a separate mapping
> for each longitude circle (i.e., all longitudes on a single latitude and at
> a single vertical level).  In addition, the maps (i.e., offset and scale
> factor) are computed on the fly (i.e., just before the data is scaled and
> written to the output file).  One reason for this is because the error
> made in packing the data depends not only on the length of integer used,
> but also on the range of the data being packed.  Typically field values
> vary much more vertically than they do zonally, so packing on latitude
> circles minimizes the range of data, and computing the maps on the fly
> allows use of a map with minimum loss of precision.  (Computing the map on
> the fly implies that for each array of values to be packed, the array is
> searched for its minimum and maximum values.  This allows the minimum
> possible range to be used rather than using a prescribed range to do the
> mapping.  The trade off is that it's expensive.)  The type of map currently
> enabled by netCDF does not allow this scheme.  It is possible, however,
> that the netCDF scheme could be used for at least some of the CCM2 output
> fields.  But the maps would need to be pre-determined because the model
> outputs data one latitude slice (i.e., latitude height cross section) at a
> time, so the full 3-D field, which would be required to compute the offset
> and scale factors on the fly, is not available at the time when data is
> output.
> 
> We propose generalizing the packing so that instead of specifying a single
> map (i.e., _Scale, and _Offset) for a variable, we would be able to
> specify maps for certain types of hyperslabs (defined below).  For
> example, suppose we have dimensions lon, level, and lat, and a variable
> temp declared as temp(lat,level,lon).  Now we would like to pack all
> longitudes at a give latitude and level, so we need to specify a map for
> each latitude and level.  The map parameters _Scale and _Offset could be
> stored in an array _temp(lat,level,2) which would have the same type as
> the original floating point data.  If we instead wanted to pack each
> horizontal level, then the map parameters would be stored in the array
> _temp(level,2).
> 
> What follows is a suggested interface of a "transparent" implementation
> of a generalized packing scheme.
> 
> Ideally the map parameters would be computed internally, but unless the
> user was restricted to outputting the same hyperslabs that packing was
> requested on, the data necessary for computing the map would not
> necessarily be available when data output was requested.  So it seems that
> the user would need to be required to provide the map parameters.  Since 
> the parameters can be calculated once the range of data is known, the user
> should be required to supply this range (which can either be calculated
> on the fly or be prescribed.)  One advantage to having the user specify
> the data range rather than to directly calculate the map parameters is
> that it is then possible for the code to make an adjustment to the range
> so that if zero is contained in the input data range, then zero will be
> a possible output from the packed data.
> 
> Since the user is required to specify what the packing hyperslabs will be,
> a new variable declaration function would be required.  The C interface
> might be:
> 
> int ncvardefp(int ncid, const char* name, int ndims, nc_type datatype,
>               const int dimids[], const int nbits, const long pkcount[]);
> 
> The arguments "ncid, name, ndims, and dimids" are the same as for ncvardef.
> The argument "datatype" would now be used to specify the type of the UNPACKED
> data (this is also the type of the map parameters).  The argument "nbits" is
> used to specify the number of binary bits of precision in the packed data,
> and the last argument allows the specification of a hyperslab shape in the
> same fashion that is used in the ncvarget function.  I am envisioning that
> the packing hyperslabs be restricted so that any element of the pkcount
> array that is not 1, must be equal to the size of the corresponding
> dimension.  This restriction is necessary for the map parameters to be
> easily indexed to the hyperslabs as in the example above.
> 
> When outputing data, the user is responsible for outputing an array
> that contains the lower and upper bounds of the data for each packing
> hyperslab contained in the output hyperslab.
> Here is a possible C interface:
> 
> int ncvarputp(int ncid, int varid, const long start[], const long count[],
>               const void *values, const void *bounds);
> 
> The arguments are the same as for ncvarput except for the addition of the
> argument "bounds" which would contain the bounding data.  An example will
> show how "bounds" should be set.
> 
> Consider the example above with dimensions lat, level, and lon, which have
> the respective sizes 64, 18, and 128.  The variable is temp(lat,level,lon)
> and we want to pack all longitude values at each latitude and level.  Then
> when we declare the variable using ncvardefp, the array pkcount should be
> set to (1,1,128) which describes the shape of the desired packing
> hyperslabs.  The map parameters will be stored internally in an array, say
> _temp(64,18,2).  Now suppose the user is ready to output a hyperslab of
> data with the shape (1,18,128).  Then bounding data is required for 18
> packing hyperslabs, and "bounds" would point to an array of size (18,2).
> If the "start" array was (37,1,1), then the "bounds" data would be stored
> internally in the locations of _temp(37,*,*) where the *s are used to
> denote the hyperslabs that correspond to setting the lat index to 37.
> 
> A potential problem is illustrated by the following example in which the
> packing hyperslab is not a subset of the output hyperslab.  Suppose as
> before that we have a variable temp(64,18,128), but we desire packing on
> horizontal slices, so the pkcount array is specified as (64,1,128).  In
> this case the map parameters will be stored internally in a array
> _temp(18,2).  As before, we request output of a hyperslab with the shape
> (1,18,128), and again the bounding data is required for 18 hyperslabs, so
> the bounds pointer should point to an array of size (18,2).  But now for
> each hyperslab of this type (there will be 64 of them) a bounds array of
> the size (18,2) is required input, and if they don't match in all 64 cases,
> then the data won't be unpacked correctly.  In other words, when a packing
> hyperslab is not output with a single call to ncvarputp, then the user has
> to supply the bounds for the same packing hyperslab more than once.  This
> is clearly an error waiting to happen, and the codes would need to check
> internally that redundant specifications were consistent in order for the
> code to be robust.