Re: Comments on netCDF/HDF Draft Design

netcdf-hdf group:  

This is a response to the response from the
Unidata team to the netCDF/HDF Design Document.  
The original response was posted to 
netcdf-hdf@xxxxxxxxxxxxxxxx on April 21.

Mike and Chris

=============================================================

Russ et al:

Thanks for your response to the "netCDF/HDF Design Document."  Now that
we have to really get the project going, things aren't nearly so simple,
and this kind of feedback is extremely useful.

We have gone over your response, and I've put together some
responses and clarifications, which follow.

Mike & Chris

>Mike,
>
...
>
>First, it is not clear from the draft what you intend with regard to data
>stored in the current netCDF format.  More specifically, will it be possible
>to use tools written to the planned netCDF/HDF interface on archives of
>current netCDF files?  Or will such archives have to be converted to HDF
>first?  We would naturally prefer that the implementation be able to
>recognize whether it is dealing with HDF or netCDF files.  Symmetrically,
>you should expect that if a program uses the netCDF/HDF interface
>implementation to create a file, then our library should be able to deal
>with it, (though we currently don't have the resources to be able to commit
>to this).  In fact this goal could be stated more strongly:
>
>  Data created by a program that uses the netCDF interface should be
>  accessible to other programs that use the netCDF interface without
>  relinking.
>
>This principle covers portability of data across different platforms, but
>also implies that a consolidated library must handle both netCDF/HDF and
>netCDF file formats and must maintain backward compatibility for archives
>stored using previous versions of the formats.  This level of
>interoperability may turn out to be impractical, but we feel it is a
>desirable goal.

We agree that it is desirable that users not have to think about (or even 
know) how their data is organized.  The difficulties involved in 
maintaining two or more storage formats are ones we already have to
deal with just within HDF.  There are instances where we've
developed newer better ways of organizing a particular object.  
It isn't fun, but so far it's been managable.  

What worries me about this policy over the long term is the 
cumulative work involved as new platforms get introduced, and 
new versions of operating systems and programming languages 
are introduced.  As these sorts of things happen, we would 
like to not be committed to supporting all old "outdated" 
formats.

Initially we definitely will support both old and new netCDF formats.
We just don't want to guarantee that we will carry it over to new
platforms and machines.

There is another issue that has to do with supporting "old"
things.  Based on feedback we're getting from loyal HDF users, we'll 
probably want to extend that idea to data models, too.  For example,
some heavy users would rather stick with the predefined SDS model
than the more general netCDF model.  In a sense, that's no problem
since netCDF provides a superset of SDS.  We might define SDS as a 
standard netCDF data abstraction for a certain range of applications.
The same has been suggested of raster images.  

Still, this kind of thing could be very confusing to users trying to 
decide whetherto use one or the other interface.  In addition we would 
want all software to know that it could treat something stored as an 
SDS they same way they tread and equivalent netCDF.

I suspect you people have already faced this problem with differently 
defined netCDFs.  My guess would be that the problem is managable if
the number of different abstractions is small.  I'd be interested in
your observations.

>  It seems to be implied by the sentence on page 4:
>
>  A hybrid implementation will also give HDF users the power of the netCDF
>  tools while at the same time making the HDF tools available to netCDF users.
>
>Note that one possible way to achieve this goal is to recognize the file
>type when a file is opened, and use the current implementations of the HDF
>and netCDF libraries as appropriate.  A new flag used when creating a file
>could specify which type of file representation was desired.

Yes, this would be a way to do it.  I would like to encourage one
format only, however, because in the long run it would make for
greater interoperability among programs.

>
>This use of two different representations for data accessed by the same
>interface can be justified if each representation has clear benefits;
>otherwise, we should agree on using the single superior representation and
>relegating the other to read-only support as long as useful archives in that
>form exist.  If a representation based on VSets is superior to the current
>netCDF representation in some ways and inferior in other significant ways,
>then the use of both representations is likely to continue.  For example, it
>may be possible to support variable-length records with the VSet
>implementation at the cost of slower hyperslab access.  In such a case,
>users would benefit most if the alternative tradeoffs captured in the two
>different representations were available from a single library at file
>creation time.  

Good example.  I think there will be times when the current netCDF
format is definitely superior.  For example, suppose I have three variables
with the unlimited dimension they are stored in an interleaved fashion.
If I access a hyperslab of "records", taking the same slab from all 
three variables, I might be able to avoid the three seeks I would have 
to make using the Vset approach (as currently designed--could change).

Another option would be to implement the netCDF physical format as an
option within HDF, so that the basic file format would still be HDF
but the physical storage would follow the old netCDF scheme.  (This
is a little tricky for the example I've given, and may be really dumb.)
We already have the option of different physical storage schemes for
individual objects (contiguous, linked blocks, and external), so the
concept is there, sort of.

>Although it may be too early to determine the advantages or
>disadvantages of one representation over the other, perhaps it needs to be
>made more clear how the benefits of the VSet-based implementation compare
>with the implementation costs and the potential space and performance
>penalties discussed in section 3.

Good idea.  We will try to expand that section.  Meantime, it would
help us if you could share with use anything you've written on why
you chose the format you did.  We have tried to determine the strengths
and weaknesses of the current format, but you have certainly thought
about it more than we have.

>
>We could not determine from the draft whether this project includes
>resources for rewriting existing HDF tools to use the netCDF/HDF 
>interface.

That isn't covered in the draft, but in the NSF proposal we say we'll do
that during the second year of the project.  With the EOS decision and
possible extra funding, we may do it sooner.  It depends a lot
on what EOS decides should be given priority.  

We've already had meetings with our tool developers and others about 
doing this, and it seems pretty straightforward, especially if we
ignore attributes that NCSA tools don't yet know about.
By the way, Ben Domenico mentioned some time ago that he might
assign somebody the task of adapting X-DataSlice to read netCDF.
Did that ever happen?

>If so, will these tools also use other HDF interfaces or low-level HDF
>calls? If so, they may not be very useful to the netCDF user community.

Good point.  We now have a situation in which any of a 
number of different types of data can be usefully read by the 
same tool.  8-bit raster, 24-bit raster, 32-bit float, 16-bit
integer, etc., all can be thought of as "images."  How we sort
this out, or let the users sort it out, is going to be tricky.

>This is a question of completeness of the interface.  If the netCDF/HDF
>interface is still missing some functionality needed by the tools and
>requiring the use of other HDF interfaces, perhaps it would be better to
>augment the netCDF/HDF interface to make it completely adequate for such
>tools.

This is an issue that we now need to really tackle.  It highlights
the fact that HDF has a number of interfaces (and correspondingly a
number of data models, I guess), whereas netCDF presents a single
data model (I guess).  There are pros and cons to each approach, 
which we probably should explicate at some point.  Pros and cons
aside,  netCDF seems to cover a goodly portion of what 
the other HDF interfaces cover.  The SDS interface obviously 
fits well into netCDF.  The raster image interface can 
be described in terms of netCDF (8-bit for sure, 24-bit
less well), though it seems to work so well with its current organization
that we'll have to think hard about whether to convert it to netCDF.
Palettes, same.  Annotations, maybe not as well, especially when we
support appending to annotations and multiple annotations per
object.

What's left is Vsets, which we put in to support unstructured
grids, as well as providing a general grouping structure.  Vsets
have become very popular, and seem to fill a number of needs.  I
think the SILO extensions to netCDF may actually give us a nice
"extended" netCDF that will cover many of the high level applications
of Vsets.

We never did think of Vsets as being a high level interface, 
but rather as a collection of routines that would
facilitate building complex organizations for certain applications,
such as graphics and finite element applications.  SILO appears
to give us that higher level extension.

  
>
>Here are some more specific comments on the draft design document, in order
>of appearance in the draft document:
>
>On page 1, paragraph 1, you state:
>
>  [netCDF] has a number of limiting factors.  Foremost among them are
>  problems of speed, extensibility and supporting code.
>
>If the netCDF model permitted more extensibility by allowing users to define
>their own basic data types, for example, it might be impractical to write
>fully general netCDF programs like the netCDF operators we have specified.
>There is a tradeoff between extensibility and generality of programs that
>may be written to a particular data model.  The ultimate extensibility is to
>permit users to write any type of data to a file, e.g. fwrite(), but then
>no useful high-level tools can be written that exploit the data model; it
>becomes equivalent to a low-level data-access interface.  The lack of
>extensibility may thus be viewed as a carefully chosen tradeoff rather than
>a correctable disadvantage.

Good point.  Highlights the fact that HDF concentrated in its early
days on providing a format that would support a variety of data models,
whereas CDF went for a single, more general model, takinng the position
that the file format was not nearly as important.  Also highlights
the fact that, for the time being at least, we feel there is enough
value in the multiple-model/extensibility aspects of HDF that we 
want to keep them.  netCDF would be one of several data models 
supported in HDF, at least initially.
  
>
>On page 2, paragraph 2:
>
>  The Unidata implementation only allows for a single unlimited dimension
>  per data set.  Expectations are that the HDF implementation will not have
>  such a limitation.
>
>We are somewhat skeptical about the practicality of supporting both multiple
>unlimited dimensions and efficient direct-access to hyperslabs.  Consider a
>single two-dimensional array with both dimensions unlimited.  Imagine
>starting with a 2 by 2 array, then adding a third column (making it 2 by 3),
>then adding a third row, (making it 3 by 3), then adding a fourth column
>(making it 3 by 4), and so on, until you have an N by N array.  Keeping the
>data contiguous is impractical, because it would require about 2*N copying
>operations, resulting in an unacceptably slow O(N**3) access algorithm for
>O(N**2) data elements.  The alternative of keeping each incremental row and
>column in its own VData would mean that accessing either the first row or
>the first column, for example, would require O(N) reads, and there would be
>no easy way of reading all the elements in the array by row or by column
>that did not require multiple reads for many of the data blocks.  With the
>current implementation, each row requires only 1 read and all the elements
>in the array may be read efficiently from the N row records.

Yes, this was less clear in the paper than it should have been.  For
exactly the reasons you have outlined above, the restriction that
any variable could only have a single unlimited dimension would have
to remain.  However, it should be possible to have a variable X
dependent on unlimited dimension 'time' and a variable Y dependent on
unlimited dimension 'foo' in the same file.

>
>
>Most netCDF programs we have seen use direct access to hyperslabs, and we
>think maintaining efficient direct access to hyperslabs of multidimensional
>data should be an important goal.  If you can eliminate the current netCDF
>restriction of only a single unlimited dimension while preserving efficient
>hyperslab access, we would be very impressed.

So would we :-).

>
>Page 2, paragraph 5:
>
>  One of the primary drawbacks of the existing Unidata implementation is
>  that it is based on XDR.
>
>This is another case where a particular tradeoff can be viewed as a drawback
>or a feature, depending on the requirements.  Use of a single specific
>external data format is an advantage when maintaining the code, comparing
>files written on different platforms, or supporting a large number of
>platforms.  Use of native format and converters, as in HDF, means that the
>addition of a new platform requires writing conversions to all other
>existing representations, whereas netCDF requires only conversion to and
>from XDR.  The performance of netCDF in some common applications relates
>more to the stdio layer below XDR than to XDR: the buffering scheme of stdio
>is not optimal for styles of access used by netCDF.  We have evidence that
>this can be fixed without abandoning XDR or the advantages of a single
>external representation.

Just one clarification here:  HDF offers native mode only on the
condition that there will be no conversion.  Some day we might 
offer conversions from and to all representations, but not now.  We've
only gotten a little flack about that.
>
...

>Page 4, paragraph 6:
>
>  For instance, it will then be possible to associate a 24-bit raster image
>  with a [netCDF] variable.
>
>We're not sure how it would be possible to access such data using the
>existing netCDF interface.  For example, if you used ncvarget(), would you
>have to provide the address of a structure for the data to be placed in?  If
>other new types are added, how can generic programs handle the data?  What
>is returned by ncvarinq() for the type of such data?  Do you intend that
>attributes can have new types like "24-bit raster image" also?  As for
>storing 24-bit data efficiently, we have circulated a proposal for packed
>netCDF data using three new reserved attributes that would support this.
>

Yeah.  Good questions.  We haven't tackled them yet.


...

>Page 8, paragraph 1:
>
>  The current VGroup access routines would require a linear search through
>  the contents of a VGroup when performing lookup functions. ... Because a
>  variable's VGroup may contain other elements (dimensions, attributes, etc.
>  ...) it is not sufficient to go to the Xth child of the VGroup when
>  looking for the Xth record.
>
>As stated above, we think it is very important to preserve direct access to
>netCDF data, and to keep hyperslab access efficient.
>

For the time being, we have decided to place all of a record
variable's data into a single VData.  In doing so, we have retained
fast hyperslab access (in fact it is even faster because all of a
variable's data is contiguous). As a side note, VDatas are able to
efficiently store and retrieve 8-bit data.

It is not yet clear whether people will require the flexibility of
storing data in separate objects.  If it does seem that users wish to
be able to store data distributedly, we will add that capability
later.  Rather than using a 'threshold' as outlined in the draft you
received, we are now leaning towards providing a reserved attribute
that the user can set to indicate whether they require all of the data
to be in a single VData or in multiple ones.

The problem with representing this information at the level of an
attribute is how to differentiate between "user" and "system"
attributes.  For instance, if someone writes out some data, goes
into redef() mode and changes the "contiguousness" / packing /
fill-values and tries to write more data things are going to be all
messed up.

Are there plans to logically separate the two types of attributes
(i.e. define_sys_attr() and define_user_attr())? Or is the distinction
just based on syntactic convention (i.e. names with leading
underscores...)?  What happens when the user wants a mutable attribute
whose name has a leading underscore?

>Page 8, paragraph 6:
>
>  Furthermore, Unidata is in the process of adding operators to netCDF,
>  which may be lost by adopting SILO as a front-end.
>
>The netCDF operators do not currently involve any extensions to the netCDF
>library; they are written entirely on top of the current library interface.
>It is possible that we will want to add an additional library function later
>to provide more efficient support for some of the netCDF operators (e.g.
>ncvarcpy() which would copy a variable from one netCDF file to another
>without going through the XDR layer).  We agree with your decision to use
>the Unidata netCDF library rather than SILO as the "front-end".

Because SILO was developed at Lawrence Livermore, it will be
impossible to use the existing SILO code in any public domain
software.  We are currently investigating whether we will even be able
to use the *ideas* developed within the Lab in the public domain.

We plan to release a description of the SILO interface over the netCDF
/ HDF mailing list in the near future to see if anyone has different
suggestions about how to model mesh data within the context of netCDF.



>
>We have set up a mailing list here for Unidata staff who are interested in
>the netCDF/HDF project:  netcdf-hdf@xxxxxxxxxxxxxxxxx  Feel free to send
>additional responses or draft documents to that address or to individual
>Unidata staff members.
>
>----
>Russ Rew                                       russ@xxxxxxxxxxxxxxxx
>Unidata Program Center    
>University Corporation for Atmospheric Research
>P.O. Box 3000
>Boulder, Colorado 80307-3000