CDF, netCDF and HDF

I have many thoughts about some of the on-going discussions on this and
related issues in the netCDF mail groups, but have been too busy to give a
coherent and useful response.  However, after seeing Hank's Griffioen's
comment after a recent posting I must put in my two cents to try to clarify
things.  After all I have some passing familarity with the subject.  I'm also
cc'g this to Greg Goucher at NSSDC, who is responsible for CDF since I don't
know if he subscribes to this mailgroup.  I would encourage you to contact him
directly for the latest information.

Hence, on soapbox...

Both netCDF and CDF support the same conceptual data model -- the idea of a
data abstraction for supporting multidimensional blocks of data -- since netCDF
is a separate and more recent implementation of the ideas that were developed
in the original VAX/VMS FORTRAN version of CDF many years ago at the NSSDC at
NASA/GSFC.  Although the model is the same, the interfaces and the physical
formats are quite different.  The current (major) release of CDF is much newer
than that of netCDF.

NetCDF has only one physical form -- a single XDR file with the multi-
dimensional arrays written by C convention (row major -- last dimension varies
fastest).  CDF supports multiple physical forms:  XDR or native, single or
multiple file (one header file and one file for each variable), row (i.e., by
C convention) or column major (i.e., by FORTRAN convention -- first dimension
varies fastest) organization and the ability to interoperate between them.
At last check, I think CDF supported a few additional data type primitives,
but that's relatively unimportant.  Although not relevant to this discussion
it also supports the original VMS format of CDF V1 (so-called CDFobsolete).

Both netCDF and CDF have a similar number of officially supported ports on
more or less the same operating systems.  NetCDF has additional ports done by
the user community compared to CDF primarily because the implementation has
been completed for some time -- certainly an important point, since it applies
to the HDF implementation as well.

As Hank alludes to, CDF has a large and growing collection of both utilities
and sophisticated general-purpose applications (some portable and some VMS-
specific from the old days).  Some of this functionality overlaps the proposed
or in-development CDF operators and the Y0 tools that Unidata will supply.
There is some overlap between CDF's CXIT tool and NCSA Image, for example.
The GEDEX CD-ROM that Hank cites is a collection of climatological data sets
that support an on-going Greenhouse Effect Detection EXperiment.  Of course,
data in CDF are also supported for a much wider range of earth science data via
the NASA Climate Data System at NASA/GSFC.  This data system is evolving to
support the eventual Earth Observing System.  CDF is also the standard for a
NASA flight program in space plasma physics called the International Solar
Terrestrial Physics Project, which involves a suite of international
spacecraft.

There is a key issue that needs to be raised, about which I have seen too
little discussion.  It relates to the notion of implementation scaling.  The
problem is that an abstraction like netCDF/CDF or the multiple abstractions,
if you will, that HDF supports must be able to scale to large, complex data
sets.  One aspect of that was the reason for supporting interoperability among
multiple physical forms in CDF, given limits in most file systems (e.g., file
size) coupled with the way that many scientists utilize data.

A second issue is data structure residency and how it is supported.  For
scaling to any reasonably interesting data set by size, structure and breadth
(i.e., number of parameters/variables/fields), data structures must be disk-
resident and have a built-in caching mechanism appropriate for those
structures.  Both CDF and netCDF attempt to do this.  In addition, transaction-
like operations on data must be supported.  In other words, the ability to
query, update/modify, delete data in-place is required.  If a substantial
investment in building a large data set is made, it is too expensive to make
updates via copying.  If I am current in my knowledge of the netCDF and CDF
implementations then this is supported in CDF and not in netCDF.  In the HDF
case none of these ideas apply because the data structures are memory resident.
(Russ, Greg and Mike please correct me if I am wrong and discuss your current
thinking on the subject).  None of these notions are new -- just ask anyone in
the DBMS community.  The difference is the data model.

A third area of scaling relates to ease of access by the end user.  The
CDF/netCDF approach provides a uniform access mechanism via a well-defined
model to arbitrary data that fits within that model.  I believe that this
is a simplifying approach for data access.  Unfortunately the data model is
too limited for many kinds of data.  This has been a focus of some of the work
in the group that I am in (the Scientific Visualization Systems Group at IBM
T. J. Watson Research Center -- we developed the IBM Data Explorer visualiza-
tion software and the IBM POWER Visualization System, a coarse-grain shared
memory parallel computational server).  The problem relates to how do you
uniformize access to data "objects" independent of their underlying mesh/grid
structure, level of aggregation or hierarchical nature?  HDF Vset is one
attempt to do so for a class of such objects.  Generalization of CDF/netCDF
arrays to non-rectilinear meshes can be accomplished by conventions for
attribute and variable specifications.  I did this myself for the original CDF
implementation eons ago and extended it to include simple irregular and sparse
meshes.  However, the underlying semantics of the netCDF/CDF data model
severely limit how far this can go.  Our approach has been to define a more
comprehensive data model than is used in netCDF/CDF.  To date, the results
have show promise.

Words with regard to scaling are insufficient.  Therefore, let me conclude
my ramblings by resurrecting ideas discussed at the SIGGRAPH '90 workshop on
data structures and access software for visualization that I chaired, where
Greg, Russ and Mike among others were active participants.  We need to quantify
these notions of scaling with "benchmark" structures/data sets and operations.
I would be very happy to discuss such metrics with anyone interested.

Off soapbox...

Thanks for any comments that anyone may have.

Lloyd Treinish

>From owner-netcdf-hdf@xxxxxxxxxxxxxxxx 19 2003 Dec -0700 06:15:30 
Message-ID: <wrxr7z17xu5.fsf@xxxxxxxxxxxxxxxxxxxxxxx>
Date: 19 Dec 2003 06:15:30 -0700
From: Ed Hartnett <ed@xxxxxxxxxxxxxxxx>
To: netcdf-hdf@xxxxxxxxxxxxxxxx
Subject: tagged netcdf-4 in cvs - netcdf-4_0_75
Received: (from majordo@localhost)
        by unidata.ucar.edu (UCAR/Unidata) id hBJDFVgg002677
        for netcdf-hdf-out; Fri, 19 Dec 2003 06:15:31 -0700 (MST)
Received: from rodney.unidata.ucar.edu (rodney.unidata.ucar.edu 
[128.117.140.88])
        by unidata.ucar.edu (UCAR/Unidata) with ESMTP id hBJDFUp2002669
        for <netcdf-hdf@xxxxxxxxxxxxxxxx>; Fri, 19 Dec 2003 06:15:31 -0700 (MST)
Organization: UCAR/Unidata
Keywords: 200312191315.hBJDFUp2002669
Lines: 7
User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Sender: owner-netcdf-hdf@xxxxxxxxxxxxxxxx
Precedence: bulk

For anyone who might care I've just tagged the netcdf-4 cvs archive,
with tag netcdf-4_0_75. This stands for version 0.75 of netcdf-4.

The tagged version passes all of nc_test, as has been noted before,
and has at least reasonable performance for reads and writes.

Ed