Comments on netCDF/HDF Draft Design

Mike,

Several of us (Mitch Baltuch, Peggy Bruehl, Glenn Davis, Steve Emmerson,
Dave Fulker, and Russ Rew) have had a chance to read and think about the
draft ``netCDF/HDF Design Document'' and we have some questions and
comments, which I've collected into the following response.  Please pass
these on to whoever else should see such comments at NCSA.

First, it is not clear from the draft what you intend with regard to data
stored in the current netCDF format.  More specifically, will it be possible
to use tools written to the planned netCDF/HDF interface on archives of
current netCDF files?  Or will such archives have to be converted to HDF
first?  We would naturally prefer that the implementation be able to
recognize whether it is dealing with HDF or netCDF files.  Symmetrically,
you should expect that if a program uses the netCDF/HDF interface
implementation to create a file, then our library should be able to deal
with it, (though we currently don't have the resources to be able to commit
to this).  In fact this goal could be stated more strongly:

  Data created by a program that uses the netCDF interface should be
  accessible to other programs that use the netCDF interface without
  relinking.

This principle covers portability of data across different platforms, but
also implies that a consolidated library must handle both netCDF/HDF and
netCDF file formats and must maintain backward compatibility for archives
stored using previous versions of the formats.  This level of
interoperability may turn out to be impractical, but we feel it is a
desirable goal.  It seems to be implied by the sentence on page 4:

  A hybrid implementation will also give HDF users the power of the netCDF
  tools while at the same time making the HDF tools available to netCDF users.

Note that one possible way to achieve this goal is to recognize the file
type when a file is opened, and use the current implementations of the HDF
and netCDF libraries as appropriate.  A new flag used when creating a file
could specify which type of file representation was desired.

This use of two different representations for data accessed by the same
interface can be justified if each representation has clear benefits;
otherwise, we should agree on using the single superior representation and
relegating the other to read-only support as long as useful archives in that
form exist.  If a representation based on VSets is superior to the current
netCDF representation in some ways and inferior in other significant ways,
then the use of both representations is likely to continue.  For example, it
may be possible to support variable-length records with the VSet
implementation at the cost of slower hyperslab access.  In such a case,
users would benefit most if the alternative tradeoffs captured in the two
different representations were available from a single library at file
creation time.  Although it may be too early to determine the advantages or
disadvantages of one representation over the other, perhaps it needs to be
made more clear how the benefits of the VSet-based implementation compare
with the implementation costs and the potential space and performance
penalties discussed in section 3.

We could not determine from the draft whether this project includes
resources for rewriting existing HDF tools to use the netCDF/HDF interface.
If so, will these tools also use other HDF interfaces or low-level HDF
calls?  If so, they may not be very useful to the netCDF user community.
This is a question of completeness of the interface.  If the netCDF/HDF
interface is still missing some functionality needed by the tools and
requiring the use of other HDF interfaces, perhaps it would be better to
augment the netCDF/HDF interface to make it completely adequate for such
tools.

Here are some more specific comments on the draft design document, in order
of appearance in the draft document:

On page 1, paragraph 1, you state:

  [netCDF] has a number of limiting factors.  Foremost among them are
  problems of speed, extensibility and supporting code.

If the netCDF model permitted more extensibility by allowing users to define
their own basic data types, for example, it might be impractical to write
fully general netCDF programs like the netCDF operators we have specified.
There is a tradeoff between extensibility and generality of programs that
may be written to a particular data model.  The ultimate extensibility is to
permit users to write any type of data to a file, e.g. fwrite(), but then
no useful high-level tools can be written that exploit the data model; it
becomes equivalent to a low-level data-access interface.  The lack of
extensibility may thus be viewed as a carefully chosen tradeoff rather than
a correctable disadvantage.

On page 2, paragraph 2:

  The Unidata implementation only allows for a single unlimited dimension
  per data set.  Expectations are that the HDF implementation will not have
  such a limitation.

We are somewhat skeptical about the practicality of supporting both multiple
unlimited dimensions and efficient direct-access to hyperslabs.  Consider a
single two-dimensional array with both dimensions unlimited.  Imagine
starting with a 2 by 2 array, then adding a third column (making it 2 by 3),
then adding a third row, (making it 3 by 3), then adding a fourth column
(making it 3 by 4), and so on, until you have an N by N array.  Keeping the
data contiguous is impractical, because it would require about 2*N copying
operations, resulting in an unacceptably slow O(N**3) access algorithm for
O(N**2) data elements.  The alternative of keeping each incremental row and
column in its own VData would mean that accessing either the first row or
the first column, for example, would require O(N) reads, and there would be
no easy way of reading all the elements in the array by row or by column
that did not require multiple reads for many of the data blocks.  With the
current implementation, each row requires only 1 read and all the elements
in the array may be read efficiently from the N row records.

Most netCDF programs we have seen use direct access to hyperslabs, and we
think maintaining efficient direct access to hyperslabs of multidimensional
data should be an important goal.  If you can eliminate the current netCDF
restriction of only a single unlimited dimension while preserving efficient
hyperslab access, we would be very impressed.

Page 2, paragraph 5:

  One of the primary drawbacks of the existing Unidata implementation is
  that it is based on XDR.

This is another case where a particular tradeoff can be viewed as a drawback
or a feature, depending on the requirements.  Use of a single specific
external data format is an advantage when maintaining the code, comparing
files written on different platforms, or supporting a large number of
platforms.  Use of native format and converters, as in HDF, means that the
addition of a new platform requires writing conversions to all other
existing representations, whereas netCDF requires only conversion to and
from XDR.  The performance of netCDF in some common applications relates
more to the stdio layer below XDR than to XDR: the buffering scheme of stdio
is not optimal for styles of access used by netCDF.  We have evidence that
this can be fixed without abandoning XDR or the advantages of a single
external representation.

Page 4, paragraph 2:

  In fact, the people at Unidata are reluctant to divulge how a netCDF
  structure is actually stored on disk ...

This is a slight overstatement.  We have only been reluctant to document the
netCDF structure in early versions of the netCDF User's Guide, but the
structure of netCDF files has always been derivable from the code, which we
make freely available.  We added a chapter to the User's Guide: ``The NetCDF
File Structure and Performance'' which discusses the parts of a netCDF file
and their order.

Page 4, paragraph 6:

  For instance, it will then be possible to associate a 24-bit raster image
  with a [netCDF] variable.

We're not sure how it would be possible to access such data using the
existing netCDF interface.  For example, if you used ncvarget(), would you
have to provide the address of a structure for the data to be placed in?  If
other new types are added, how can generic programs handle the data?  What
is returned by ncvarinq() for the type of such data?  Do you intend that
attributes can have new types like "24-bit raster image" also?  As for
storing 24-bit data efficiently, we have circulated a proposal for packed
netCDF data using three new reserved attributes that would support this.

Page 5, paragraph 4:

  Then if the user wants to associate any attributes with that dimension,
  they are forced to create a variable with the same name (i.e. time(time)
  in the variable section of Figure 1) and associate any attributes with the
  variable. ... Since a dimension can have any number of attributes, it is
  necessary ...

Strictly speaking, a netCDF dimension can't have attributes, only a name and
a size.  If a variable has the same name as a netCDF dimension and the
variable's shape is specified by that dimension, it is treated by convention
only as a coordinate variable for the dimension.  The amount of space saved
by merging dimensions with their coordinate variables seems small, since
netCDF datasets typically have a small number of dimensions compared to the
amount of data.  It might even end up taking more space for some datasets,
since you presumably would have to generate dimension values for dimensions
that had no corresponding coordinate variable.

Page 7, paragraph 2:

  ... it is not readily clear that a distinction needs to be made between
  dimensions and variables.

Dimensions serve to interrelate variables that are defined on a common grid,
as well as specifying shapes and sizes of variables.  It seems necessary to
preserve the distinction between netCDF dimensions and variables for several
reasons.  First, some variables cannot serve in the role of dimensions, for
example multidimensional variables, or single-dimension variables with
non-monotonic values.  Second, some of the properties of variables make no
sense for dimensions, for example missing values, type, and associated
attributes.  Some characteristics of dimensions also do not make sense for
variables, for example it is easy to define what is meant by an "unused
dimension" (not used to define the shapes of any variables), but what would
an "unused variable" mean.  We think you are right when you say
 
  Representing these two object the same way may cause more problems than it
  solves ...

Page 7, paragraph 5:

  However, people have asked that the netCDF be able to handle 300,000
  records, each record containing a single 8-bit data element.

We currently round the size of each record up to the nearest 32-bit
boundary, so you may be trying something too ambitious if you plan to make
this much more space-efficient than under the current implementation.
However the 50-byte overhead for each record under HDF, if each record is
stored as a VData, does seem too extravagant.

Page 8, paragraph 1:

  The current VGroup access routines would require a linear search through
  the contents of a VGroup when performing lookup functions. ... Because a
  variable's VGroup may contain other elements (dimensions, attributes, etc.
  ...) it is not sufficient to go to the Xth child of the VGroup when
  looking for the Xth record.

As stated above, we think it is very important to preserve direct access to
netCDF data, and to keep hyperslab access efficient.

Page 8, paragraph 6:

  Furthermore, Unidata is in the process of adding operators to netCDF,
  which may be lost by adopting SILO as a front-end.

The netCDF operators do not currently involve any extensions to the netCDF
library; they are written entirely on top of the current library interface.
It is possible that we will want to add an additional library function later
to provide more efficient support for some of the netCDF operators (e.g.
ncvarcpy() which would copy a variable from one netCDF file to another
without going through the XDR layer).  We agree with your decision to use
the Unidata netCDF library rather than SILO as the "front-end".

We have set up a mailing list here for Unidata staff who are interested in
the netCDF/HDF project:  netcdf-hdf@xxxxxxxxxxxxxxxxx  Feel free to send
additional responses or draft documents to that address or to individual
Unidata staff members.

----
Russ Rew                                        russ@xxxxxxxxxxxxxxxx
Unidata Program Center    
University Corporation for Atmospheric Research
P.O. Box 3000
Boulder, Colorado 80307-3000