netcdf-hdf mailing list is no longer active. The list archives are made available for historical reasons.
================ Here's the ncsa reply to the Unidata reply ======== >To: Russ Rew <russ@xxxxxxxxxxxxxxxx> >From: mfolk@xxxxxxxxxxxxx >Subject: Re: Comments on netCDF/HDF Draft Design >Cc: netcdf-hdf@xxxxxxxxxxxxxxxx >Bcc: >X-Attachments: > >netcdf-hdf group: > >This is a response to the response from the >Unidata team to the netCDF/HDF Design Document. >The original response was posted to >netcdf-hdf@xxxxxxxxxxxxxxxx on April 21. > >Mike and Chris > >============================================================= > >Russ et al: > >Thanks for your response to the "netCDF/HDF Design Document." Now that >we have to really get the project going, things aren't nearly so simple, >and this kind of feedback is extremely useful. > >We have gone over your response, and I've put together some >responses and clarifications, which follow. > >Mike & Chris > >>Mike, >> >... >> >>First, it is not clear from the draft what you intend with regard to data >>stored in the current netCDF format. More specifically, will it be possible >>to use tools written to the planned netCDF/HDF interface on archives of >>current netCDF files? Or will such archives have to be converted to HDF >>first? We would naturally prefer that the implementation be able to >>recognize whether it is dealing with HDF or netCDF files. Symmetrically, >>you should expect that if a program uses the netCDF/HDF interface >>implementation to create a file, then our library should be able to deal >>with it, (though we currently don't have the resources to be able to commit >>to this). In fact this goal could be stated more strongly: >> >> Data created by a program that uses the netCDF interface should be >> accessible to other programs that use the netCDF interface without >> relinking. >> >>This principle covers portability of data across different platforms, but >>also implies that a consolidated library must handle both netCDF/HDF and >>netCDF file formats and must maintain backward compatibility for archives >>stored using previous versions of the formats. This level of >>interoperability may turn out to be impractical, but we feel it is a >>desirable goal. > >We agree that it is desirable that users not have to think about (or even >know) how their data is organized. The difficulties involved in >maintaining two or more storage formats are ones we already have to >deal with just within HDF. There are instances where we've >developed newer better ways of organizing a particular object. >It isn't fun, but so far it's been managable. > >What worries me about this policy over the long term is the >cumulative work involved as new platforms get introduced, and >new versions of operating systems and programming languages >are introduced. As these sorts of things happen, we would >like to not be committed to supporting all old "outdated" >formats. > >Initially we definitely will support both old and new netCDF formats. >We just don't want to guarantee that we will carry it over to new >platforms and machines. > >There is another issue that has to do with supporting "old" >things. Based on feedback we're getting from loyal HDF users, we'll >probably want to extend that idea to data models, too. For example, >some heavy users would rather stick with the predefined SDS model >than the more general netCDF model. In a sense, that's no problem >since netCDF provides a superset of SDS. We might define SDS as a >standard netCDF data abstraction for a certain range of applications. >The same has been suggested of raster images. > >Still, this kind of thing could be very confusing to users trying to >decide whetherto use one or the other interface. In addition we would >want all software to know that it could treat something stored as an >SDS they same way they tread and equivalent netCDF. > >I suspect you people have already faced this problem with differently >defined netCDFs. My guess would be that the problem is managable if >the number of different abstractions is small. I'd be interested in >your observations. > >> It seems to be implied by the sentence on page 4: >> >> A hybrid implementation will also give HDF users the power of the netCDF >> tools while at the same time making the HDF tools available to netCDF users. >> >>Note that one possible way to achieve this goal is to recognize the file >>type when a file is opened, and use the current implementations of the HDF >>and netCDF libraries as appropriate. A new flag used when creating a file >>could specify which type of file representation was desired. > >Yes, this would be a way to do it. I would like to encourage one >format only, however, because in the long run it would make for >greater interoperability among programs. > >> >>This use of two different representations for data accessed by the same >>interface can be justified if each representation has clear benefits; >>otherwise, we should agree on using the single superior representation and >>relegating the other to read-only support as long as useful archives in that >>form exist. If a representation based on VSets is superior to the current >>netCDF representation in some ways and inferior in other significant ways, >>then the use of both representations is likely to continue. For example, it >>may be possible to support variable-length records with the VSet >>implementation at the cost of slower hyperslab access. In such a case, >>users would benefit most if the alternative tradeoffs captured in the two >>different representations were available from a single library at file >>creation time. > >Good example. I think there will be times when the current netCDF >format is definitely superior. For example, suppose I have three variables >with the unlimited dimension they are stored in an interleaved fashion. >If I access a hyperslab of "records", taking the same slab from all >three variables, I might be able to avoid the three seeks I would have >to make using the Vset approach (as currently designed--could change). > >Another option would be to implement the netCDF physical format as an >option within HDF, so that the basic file format would still be HDF >but the physical storage would follow the old netCDF scheme. (This >is a little tricky for the example I've given, and may be really dumb.) >We already have the option of different physical storage schemes for >individual objects (contiguous, linked blocks, and external), so the >concept is there, sort of. > >>Although it may be too early to determine the advantages or >>disadvantages of one representation over the other, perhaps it needs to be >>made more clear how the benefits of the VSet-based implementation compare >>with the implementation costs and the potential space and performance >>penalties discussed in section 3. > >Good idea. We will try to expand that section. Meantime, it would >help us if you could share with use anything you've written on why >you chose the format you did. We have tried to determine the strengths >and weaknesses of the current format, but you have certainly thought >about it more than we have. > >> >>We could not determine from the draft whether this project includes >>resources for rewriting existing HDF tools to use the netCDF/HDF >>interface. > >That isn't covered in the draft, but in the NSF proposal we say we'll do >that during the second year of the project. With the EOS decision and >possible extra funding, we may do it sooner. It depends a lot >on what EOS decides should be given priority. > >We've already had meetings with our tool developers and others about >doing this, and it seems pretty straightforward, especially if we >ignore attributes that NCSA tools don't yet know about. >By the way, Ben Domenico mentioned some time ago that he might >assign somebody the task of adapting X-DataSlice to read netCDF. >Did that ever happen? > >>If so, will these tools also use other HDF interfaces or low-level HDF >>calls? If so, they may not be very useful to the netCDF user community. > >Good point. We now have a situation in which any of a >number of different types of data can be usefully read by the >same tool. 8-bit raster, 24-bit raster, 32-bit float, 16-bit >integer, etc., all can be thought of as "images." How we sort >this out, or let the users sort it out, is going to be tricky. > >>This is a question of completeness of the interface. If the netCDF/HDF >>interface is still missing some functionality needed by the tools and >>requiring the use of other HDF interfaces, perhaps it would be better to >>augment the netCDF/HDF interface to make it completely adequate for such >>tools. > >This is an issue that we now need to really tackle. It highlights >the fact that HDF has a number of interfaces (and correspondingly a >number of data models, I guess), whereas netCDF presents a single >data model (I guess). There are pros and cons to each approach, >which we probably should explicate at some point. Pros and cons >aside, netCDF seems to cover a goodly portion of what >the other HDF interfaces cover. The SDS interface obviously >fits well into netCDF. The raster image interface can >be described in terms of netCDF (8-bit for sure, 24-bit >less well), though it seems to work so well with its current organization >that we'll have to think hard about whether to convert it to netCDF. >Palettes, same. Annotations, maybe not as well, especially when we >support appending to annotations and multiple annotations per >object. > >What's left is Vsets, which we put in to support unstructured >grids, as well as providing a general grouping structure. Vsets >have become very popular, and seem to fill a number of needs. I >think the SILO extensions to netCDF may actually give us a nice >"extended" netCDF that will cover many of the high level applications >of Vsets. > >We never did think of Vsets as being a high level interface, >but rather as a collection of routines that would >facilitate building complex organizations for certain applications, >such as graphics and finite element applications. SILO appears >to give us that higher level extension. > > >> >>Here are some more specific comments on the draft design document, in order >>of appearance in the draft document: >> >>On page 1, paragraph 1, you state: >> >> [netCDF] has a number of limiting factors. Foremost among them are >> problems of speed, extensibility and supporting code. >> >>If the netCDF model permitted more extensibility by allowing users to define >>their own basic data types, for example, it might be impractical to write >>fully general netCDF programs like the netCDF operators we have specified. >>There is a tradeoff between extensibility and generality of programs that >>may be written to a particular data model. The ultimate extensibility is to >>permit users to write any type of data to a file, e.g. fwrite(), but then >>no useful high-level tools can be written that exploit the data model; it >>becomes equivalent to a low-level data-access interface. The lack of >>extensibility may thus be viewed as a carefully chosen tradeoff rather than >>a correctable disadvantage. > >Good point. Highlights the fact that HDF concentrated in its early >days on providing a format that would support a variety of data models, >whereas CDF went for a single, more general model, takinng the position >that the file format was not nearly as important. Also highlights >the fact that, for the time being at least, we feel there is enough >value in the multiple-model/extensibility aspects of HDF that we >want to keep them. netCDF would be one of several data models >supported in HDF, at least initially. > >> >>On page 2, paragraph 2: >> >> The Unidata implementation only allows for a single unlimited dimension >> per data set. Expectations are that the HDF implementation will not have >> such a limitation. >> >>We are somewhat skeptical about the practicality of supporting both multiple >>unlimited dimensions and efficient direct-access to hyperslabs. Consider a >>single two-dimensional array with both dimensions unlimited. Imagine >>starting with a 2 by 2 array, then adding a third column (making it 2 by 3), >>then adding a third row, (making it 3 by 3), then adding a fourth column >>(making it 3 by 4), and so on, until you have an N by N array. Keeping the >>data contiguous is impractical, because it would require about 2*N copying >>operations, resulting in an unacceptably slow O(N**3) access algorithm for >>O(N**2) data elements. The alternative of keeping each incremental row and >>column in its own VData would mean that accessing either the first row or >>the first column, for example, would require O(N) reads, and there would be >>no easy way of reading all the elements in the array by row or by column >>that did not require multiple reads for many of the data blocks. With the >>current implementation, each row requires only 1 read and all the elements >>in the array may be read efficiently from the N row records. > >Yes, this was less clear in the paper than it should have been. For >exactly the reasons you have outlined above, the restriction that >any variable could only have a single unlimited dimension would have >to remain. However, it should be possible to have a variable X >dependent on unlimited dimension 'time' and a variable Y dependent on >unlimited dimension 'foo' in the same file. > >> >> >>Most netCDF programs we have seen use direct access to hyperslabs, and we >>think maintaining efficient direct access to hyperslabs of multidimensional >>data should be an important goal. If you can eliminate the current netCDF >>restriction of only a single unlimited dimension while preserving efficient >>hyperslab access, we would be very impressed. > >So would we :-). > >> >>Page 2, paragraph 5: >> >> One of the primary drawbacks of the existing Unidata implementation is >> that it is based on XDR. >> >>This is another case where a particular tradeoff can be viewed as a drawback >>or a feature, depending on the requirements. Use of a single specific >>external data format is an advantage when maintaining the code, comparing >>files written on different platforms, or supporting a large number of >>platforms. Use of native format and converters, as in HDF, means that the >>addition of a new platform requires writing conversions to all other >>existing representations, whereas netCDF requires only conversion to and >>from XDR. The performance of netCDF in some common applications relates >>more to the stdio layer below XDR than to XDR: the buffering scheme of stdio >>is not optimal for styles of access used by netCDF. We have evidence that >>this can be fixed without abandoning XDR or the advantages of a single >>external representation. > >Just one clarification here: HDF offers native mode only on the >condition that there will be no conversion. Some day we might >offer conversions from and to all representations, but not now. We've >only gotten a little flack about that. >> >... > >>Page 4, paragraph 6: >> >> For instance, it will then be possible to associate a 24-bit raster image >> with a [netCDF] variable. >> >>We're not sure how it would be possible to access such data using the >>existing netCDF interface. For example, if you used ncvarget(), would you >>have to provide the address of a structure for the data to be placed in? If >>other new types are added, how can generic programs handle the data? What >>is returned by ncvarinq() for the type of such data? Do you intend that >>attributes can have new types like "24-bit raster image" also? As for >>storing 24-bit data efficiently, we have circulated a proposal for packed >>netCDF data using three new reserved attributes that would support this. >> > >Yeah. Good questions. We haven't tackled them yet. > > >... > >>Page 8, paragraph 1: >> >> The current VGroup access routines would require a linear search through >> the contents of a VGroup when performing lookup functions. ... Because a >> variable's VGroup may contain other elements (dimensions, attributes, etc. >> ...) it is not sufficient to go to the Xth child of the VGroup when >> looking for the Xth record. >> >>As stated above, we think it is very important to preserve direct access to >>netCDF data, and to keep hyperslab access efficient. >> > >For the time being, we have decided to place all of a record >variable's data into a single VData. In doing so, we have retained >fast hyperslab access (in fact it is even faster because all of a >variable's data is contiguous). As a side note, VDatas are able to >efficiently store and retrieve 8-bit data. > >It is not yet clear whether people will require the flexibility of >storing data in separate objects. If it does seem that users wish to >be able to store data distributedly, we will add that capability >later. Rather than using a 'threshold' as outlined in the draft you >received, we are now leaning towards providing a reserved attribute >that the user can set to indicate whether they require all of the data >to be in a single VData or in multiple ones. > >The problem with representing this information at the level of an >attribute is how to differentiate between "user" and "system" >attributes. For instance, if someone writes out some data, goes >into redef() mode and changes the "contiguousness" / packing / >fill-values and tries to write more data things are going to be all >messed up. > >Are there plans to logically separate the two types of attributes >(i.e. define_sys_attr() and define_user_attr())? Or is the distinction >just based on syntactic convention (i.e. names with leading >underscores...)? What happens when the user wants a mutable attribute >whose name has a leading underscore? > >>Page 8, paragraph 6: >> >> Furthermore, Unidata is in the process of adding operators to netCDF, >> which may be lost by adopting SILO as a front-end. >> >>The netCDF operators do not currently involve any extensions to the netCDF >>library; they are written entirely on top of the current library interface. >>It is possible that we will want to add an additional library function later >>to provide more efficient support for some of the netCDF operators (e.g. >>ncvarcpy() which would copy a variable from one netCDF file to another >>without going through the XDR layer). We agree with your decision to use >>the Unidata netCDF library rather than SILO as the "front-end". > >Because SILO was developed at Lawrence Livermore, it will be >impossible to use the existing SILO code in any public domain >software. We are currently investigating whether we will even be able >to use the *ideas* developed within the Lab in the public domain. > >We plan to release a description of the SILO interface over the netCDF >/ HDF mailing list in the near future to see if anyone has different >suggestions about how to model mesh data within the context of netCDF. > > > >> >>We have set up a mailing list here for Unidata staff who are interested in >>the netCDF/HDF project: netcdf-hdf@xxxxxxxxxxxxxxxxx Feel free to send >>additional responses or draft documents to that address or to individual >>Unidata staff members. >> >>---- >>Russ Rew russ@xxxxxxxxxxxxxxxx >>Unidata Program Center >>University Corporation for Atmospheric Research >>P.O. Box 3000 >>Boulder, Colorado 80307-3000 > >