Reply to Comments on netCDF/HDF Draft Design

================ Here's the ncsa reply to the Unidata reply ========

>To: Russ Rew <russ@xxxxxxxxxxxxxxxx>
>From: mfolk@xxxxxxxxxxxxx
>Subject: Re: Comments on netCDF/HDF Draft Design
>Cc: netcdf-hdf@xxxxxxxxxxxxxxxx
>Bcc: 
>X-Attachments: 
>
>netcdf-hdf group:  
>
>This is a response to the response from the
>Unidata team to the netCDF/HDF Design Document.  
>The original response was posted to 
>netcdf-hdf@xxxxxxxxxxxxxxxx on April 21.
>
>Mike and Chris
>
>=============================================================
>
>Russ et al:
>
>Thanks for your response to the "netCDF/HDF Design Document."  Now that
>we have to really get the project going, things aren't nearly so simple,
>and this kind of feedback is extremely useful.
>
>We have gone over your response, and I've put together some
>responses and clarifications, which follow.
>
>Mike & Chris
>
>>Mike,
>>
>...
>>
>>First, it is not clear from the draft what you intend with regard to data
>>stored in the current netCDF format.  More specifically, will it be possible
>>to use tools written to the planned netCDF/HDF interface on archives of
>>current netCDF files?  Or will such archives have to be converted to HDF
>>first?  We would naturally prefer that the implementation be able to
>>recognize whether it is dealing with HDF or netCDF files.  Symmetrically,
>>you should expect that if a program uses the netCDF/HDF interface
>>implementation to create a file, then our library should be able to deal
>>with it, (though we currently don't have the resources to be able to commit
>>to this).  In fact this goal could be stated more strongly:
>>
>>  Data created by a program that uses the netCDF interface should be
>>  accessible to other programs that use the netCDF interface without
>>  relinking.
>>
>>This principle covers portability of data across different platforms, but
>>also implies that a consolidated library must handle both netCDF/HDF and
>>netCDF file formats and must maintain backward compatibility for archives
>>stored using previous versions of the formats.  This level of
>>interoperability may turn out to be impractical, but we feel it is a
>>desirable goal.
>
>We agree that it is desirable that users not have to think about (or even 
>know) how their data is organized.  The difficulties involved in 
>maintaining two or more storage formats are ones we already have to
>deal with just within HDF.  There are instances where we've
>developed newer better ways of organizing a particular object.  
>It isn't fun, but so far it's been managable.  
>
>What worries me about this policy over the long term is the 
>cumulative work involved as new platforms get introduced, and 
>new versions of operating systems and programming languages 
>are introduced.  As these sorts of things happen, we would 
>like to not be committed to supporting all old "outdated" 
>formats.
>
>Initially we definitely will support both old and new netCDF formats.
>We just don't want to guarantee that we will carry it over to new
>platforms and machines.
>
>There is another issue that has to do with supporting "old"
>things.  Based on feedback we're getting from loyal HDF users, we'll 
>probably want to extend that idea to data models, too.  For example,
>some heavy users would rather stick with the predefined SDS model
>than the more general netCDF model.  In a sense, that's no problem
>since netCDF provides a superset of SDS.  We might define SDS as a 
>standard netCDF data abstraction for a certain range of applications.
>The same has been suggested of raster images.  
>
>Still, this kind of thing could be very confusing to users trying to 
>decide whetherto use one or the other interface.  In addition we would 
>want all software to know that it could treat something stored as an 
>SDS they same way they tread and equivalent netCDF.
>
>I suspect you people have already faced this problem with differently 
>defined netCDFs.  My guess would be that the problem is managable if
>the number of different abstractions is small.  I'd be interested in
>your observations.
>
>>  It seems to be implied by the sentence on page 4:
>>
>>  A hybrid implementation will also give HDF users the power of the netCDF
>>  tools while at the same time making the HDF tools available to netCDF users.
>>
>>Note that one possible way to achieve this goal is to recognize the file
>>type when a file is opened, and use the current implementations of the HDF
>>and netCDF libraries as appropriate.  A new flag used when creating a file
>>could specify which type of file representation was desired.
>
>Yes, this would be a way to do it.  I would like to encourage one
>format only, however, because in the long run it would make for
>greater interoperability among programs.
>
>>
>>This use of two different representations for data accessed by the same
>>interface can be justified if each representation has clear benefits;
>>otherwise, we should agree on using the single superior representation and
>>relegating the other to read-only support as long as useful archives in that
>>form exist.  If a representation based on VSets is superior to the current
>>netCDF representation in some ways and inferior in other significant ways,
>>then the use of both representations is likely to continue.  For example, it
>>may be possible to support variable-length records with the VSet
>>implementation at the cost of slower hyperslab access.  In such a case,
>>users would benefit most if the alternative tradeoffs captured in the two
>>different representations were available from a single library at file
>>creation time.  
>
>Good example.  I think there will be times when the current netCDF
>format is definitely superior.  For example, suppose I have three variables
>with the unlimited dimension they are stored in an interleaved fashion.
>If I access a hyperslab of "records", taking the same slab from all 
>three variables, I might be able to avoid the three seeks I would have 
>to make using the Vset approach (as currently designed--could change).
>
>Another option would be to implement the netCDF physical format as an
>option within HDF, so that the basic file format would still be HDF
>but the physical storage would follow the old netCDF scheme.  (This
>is a little tricky for the example I've given, and may be really dumb.)
>We already have the option of different physical storage schemes for
>individual objects (contiguous, linked blocks, and external), so the
>concept is there, sort of.
>
>>Although it may be too early to determine the advantages or
>>disadvantages of one representation over the other, perhaps it needs to be
>>made more clear how the benefits of the VSet-based implementation compare
>>with the implementation costs and the potential space and performance
>>penalties discussed in section 3.
>
>Good idea.  We will try to expand that section.  Meantime, it would
>help us if you could share with use anything you've written on why
>you chose the format you did.  We have tried to determine the strengths
>and weaknesses of the current format, but you have certainly thought
>about it more than we have.
>
>>
>>We could not determine from the draft whether this project includes
>>resources for rewriting existing HDF tools to use the netCDF/HDF 
>>interface.
>
>That isn't covered in the draft, but in the NSF proposal we say we'll do
>that during the second year of the project.  With the EOS decision and
>possible extra funding, we may do it sooner.  It depends a lot
>on what EOS decides should be given priority.  
>
>We've already had meetings with our tool developers and others about 
>doing this, and it seems pretty straightforward, especially if we
>ignore attributes that NCSA tools don't yet know about.
>By the way, Ben Domenico mentioned some time ago that he might
>assign somebody the task of adapting X-DataSlice to read netCDF.
>Did that ever happen?
>
>>If so, will these tools also use other HDF interfaces or low-level HDF
>>calls? If so, they may not be very useful to the netCDF user community.
>
>Good point.  We now have a situation in which any of a 
>number of different types of data can be usefully read by the 
>same tool.  8-bit raster, 24-bit raster, 32-bit float, 16-bit
>integer, etc., all can be thought of as "images."  How we sort
>this out, or let the users sort it out, is going to be tricky.
>
>>This is a question of completeness of the interface.  If the netCDF/HDF
>>interface is still missing some functionality needed by the tools and
>>requiring the use of other HDF interfaces, perhaps it would be better to
>>augment the netCDF/HDF interface to make it completely adequate for such
>>tools.
>
>This is an issue that we now need to really tackle.  It highlights
>the fact that HDF has a number of interfaces (and correspondingly a
>number of data models, I guess), whereas netCDF presents a single
>data model (I guess).  There are pros and cons to each approach, 
>which we probably should explicate at some point.  Pros and cons
>aside,  netCDF seems to cover a goodly portion of what 
>the other HDF interfaces cover.  The SDS interface obviously 
>fits well into netCDF.  The raster image interface can 
>be described in terms of netCDF (8-bit for sure, 24-bit
>less well), though it seems to work so well with its current organization
>that we'll have to think hard about whether to convert it to netCDF.
>Palettes, same.  Annotations, maybe not as well, especially when we
>support appending to annotations and multiple annotations per
>object.
>
>What's left is Vsets, which we put in to support unstructured
>grids, as well as providing a general grouping structure.  Vsets
>have become very popular, and seem to fill a number of needs.  I
>think the SILO extensions to netCDF may actually give us a nice
>"extended" netCDF that will cover many of the high level applications
>of Vsets.
>
>We never did think of Vsets as being a high level interface, 
>but rather as a collection of routines that would
>facilitate building complex organizations for certain applications,
>such as graphics and finite element applications.  SILO appears
>to give us that higher level extension.
>
>  
>>
>>Here are some more specific comments on the draft design document, in order
>>of appearance in the draft document:
>>
>>On page 1, paragraph 1, you state:
>>
>>  [netCDF] has a number of limiting factors.  Foremost among them are
>>  problems of speed, extensibility and supporting code.
>>
>>If the netCDF model permitted more extensibility by allowing users to define
>>their own basic data types, for example, it might be impractical to write
>>fully general netCDF programs like the netCDF operators we have specified.
>>There is a tradeoff between extensibility and generality of programs that
>>may be written to a particular data model.  The ultimate extensibility is to
>>permit users to write any type of data to a file, e.g. fwrite(), but then
>>no useful high-level tools can be written that exploit the data model; it
>>becomes equivalent to a low-level data-access interface.  The lack of
>>extensibility may thus be viewed as a carefully chosen tradeoff rather than
>>a correctable disadvantage.
>
>Good point.  Highlights the fact that HDF concentrated in its early
>days on providing a format that would support a variety of data models,
>whereas CDF went for a single, more general model, takinng the position
>that the file format was not nearly as important.  Also highlights
>the fact that, for the time being at least, we feel there is enough
>value in the multiple-model/extensibility aspects of HDF that we 
>want to keep them.  netCDF would be one of several data models 
>supported in HDF, at least initially.
>  
>>
>>On page 2, paragraph 2:
>>
>>  The Unidata implementation only allows for a single unlimited dimension
>>  per data set.  Expectations are that the HDF implementation will not have
>>  such a limitation.
>>
>>We are somewhat skeptical about the practicality of supporting both multiple
>>unlimited dimensions and efficient direct-access to hyperslabs.  Consider a
>>single two-dimensional array with both dimensions unlimited.  Imagine
>>starting with a 2 by 2 array, then adding a third column (making it 2 by 3),
>>then adding a third row, (making it 3 by 3), then adding a fourth column
>>(making it 3 by 4), and so on, until you have an N by N array.  Keeping the
>>data contiguous is impractical, because it would require about 2*N copying
>>operations, resulting in an unacceptably slow O(N**3) access algorithm for
>>O(N**2) data elements.  The alternative of keeping each incremental row and
>>column in its own VData would mean that accessing either the first row or
>>the first column, for example, would require O(N) reads, and there would be
>>no easy way of reading all the elements in the array by row or by column
>>that did not require multiple reads for many of the data blocks.  With the
>>current implementation, each row requires only 1 read and all the elements
>>in the array may be read efficiently from the N row records.
>
>Yes, this was less clear in the paper than it should have been.  For
>exactly the reasons you have outlined above, the restriction that
>any variable could only have a single unlimited dimension would have
>to remain.  However, it should be possible to have a variable X
>dependent on unlimited dimension 'time' and a variable Y dependent on
>unlimited dimension 'foo' in the same file.
>
>>
>>
>>Most netCDF programs we have seen use direct access to hyperslabs, and we
>>think maintaining efficient direct access to hyperslabs of multidimensional
>>data should be an important goal.  If you can eliminate the current netCDF
>>restriction of only a single unlimited dimension while preserving efficient
>>hyperslab access, we would be very impressed.
>
>So would we :-).
>
>>
>>Page 2, paragraph 5:
>>
>>  One of the primary drawbacks of the existing Unidata implementation is
>>  that it is based on XDR.
>>
>>This is another case where a particular tradeoff can be viewed as a drawback
>>or a feature, depending on the requirements.  Use of a single specific
>>external data format is an advantage when maintaining the code, comparing
>>files written on different platforms, or supporting a large number of
>>platforms.  Use of native format and converters, as in HDF, means that the
>>addition of a new platform requires writing conversions to all other
>>existing representations, whereas netCDF requires only conversion to and
>>from XDR.  The performance of netCDF in some common applications relates
>>more to the stdio layer below XDR than to XDR: the buffering scheme of stdio
>>is not optimal for styles of access used by netCDF.  We have evidence that
>>this can be fixed without abandoning XDR or the advantages of a single
>>external representation.
>
>Just one clarification here:  HDF offers native mode only on the
>condition that there will be no conversion.  Some day we might 
>offer conversions from and to all representations, but not now.  We've
>only gotten a little flack about that.
>>
>...
>
>>Page 4, paragraph 6:
>>
>>  For instance, it will then be possible to associate a 24-bit raster image
>>  with a [netCDF] variable.
>>
>>We're not sure how it would be possible to access such data using the
>>existing netCDF interface.  For example, if you used ncvarget(), would you
>>have to provide the address of a structure for the data to be placed in?  If
>>other new types are added, how can generic programs handle the data?  What
>>is returned by ncvarinq() for the type of such data?  Do you intend that
>>attributes can have new types like "24-bit raster image" also?  As for
>>storing 24-bit data efficiently, we have circulated a proposal for packed
>>netCDF data using three new reserved attributes that would support this.
>>
>
>Yeah.  Good questions.  We haven't tackled them yet.
>
>
>...
>
>>Page 8, paragraph 1:
>>
>>  The current VGroup access routines would require a linear search through
>>  the contents of a VGroup when performing lookup functions. ... Because a
>>  variable's VGroup may contain other elements (dimensions, attributes, etc.
>>  ...) it is not sufficient to go to the Xth child of the VGroup when
>>  looking for the Xth record.
>>
>>As stated above, we think it is very important to preserve direct access to
>>netCDF data, and to keep hyperslab access efficient.
>>
>
>For the time being, we have decided to place all of a record
>variable's data into a single VData.  In doing so, we have retained
>fast hyperslab access (in fact it is even faster because all of a
>variable's data is contiguous). As a side note, VDatas are able to
>efficiently store and retrieve 8-bit data.
>
>It is not yet clear whether people will require the flexibility of
>storing data in separate objects.  If it does seem that users wish to
>be able to store data distributedly, we will add that capability
>later.  Rather than using a 'threshold' as outlined in the draft you
>received, we are now leaning towards providing a reserved attribute
>that the user can set to indicate whether they require all of the data
>to be in a single VData or in multiple ones.
>
>The problem with representing this information at the level of an
>attribute is how to differentiate between "user" and "system"
>attributes.  For instance, if someone writes out some data, goes
>into redef() mode and changes the "contiguousness" / packing /
>fill-values and tries to write more data things are going to be all
>messed up.
>
>Are there plans to logically separate the two types of attributes
>(i.e. define_sys_attr() and define_user_attr())? Or is the distinction
>just based on syntactic convention (i.e. names with leading
>underscores...)?  What happens when the user wants a mutable attribute
>whose name has a leading underscore?
>
>>Page 8, paragraph 6:
>>
>>  Furthermore, Unidata is in the process of adding operators to netCDF,
>>  which may be lost by adopting SILO as a front-end.
>>
>>The netCDF operators do not currently involve any extensions to the netCDF
>>library; they are written entirely on top of the current library interface.
>>It is possible that we will want to add an additional library function later
>>to provide more efficient support for some of the netCDF operators (e.g.
>>ncvarcpy() which would copy a variable from one netCDF file to another
>>without going through the XDR layer).  We agree with your decision to use
>>the Unidata netCDF library rather than SILO as the "front-end".
>
>Because SILO was developed at Lawrence Livermore, it will be
>impossible to use the existing SILO code in any public domain
>software.  We are currently investigating whether we will even be able
>to use the *ideas* developed within the Lab in the public domain.
>
>We plan to release a description of the SILO interface over the netCDF
>/ HDF mailing list in the near future to see if anyone has different
>suggestions about how to model mesh data within the context of netCDF.
>
>
>
>>
>>We have set up a mailing list here for Unidata staff who are interested in
>>the netCDF/HDF project:  netcdf-hdf@xxxxxxxxxxxxxxxxx  Feel free to send
>>additional responses or draft documents to that address or to individual
>>Unidata staff members.
>>
>>----
>>Russ Rew                                      russ@xxxxxxxxxxxxxxxx
>>Unidata Program Center    
>>University Corporation for Atmospheric Research
>>P.O. Box 3000
>>Boulder, Colorado 80307-3000
>
>