CDF, netCDF and HDF

Lloyd,

> I assume by changing shape you mean NOT considering the unlimited dimension.

That's right.  I think Glenn was using `shape' of a netCDF file loosely to
mean what I call the `schema': the dimensions, variables, and attributes of
the netCDF.  The unlimited dimension is the only dimension along which data
can be appended to an existing netCDF file.  Even creating a new variable
that uses existing dimensions requires copying.  I believe this is one of
the differences between netCDF and CDF.  Doesn't CDF allow adding new
variables without copying, at least in the multi-file implementation?

In designing netCDF, we considered the trade-offs and concluded that adding
new variables to an existing data file was not a common enough operation
among our users and applications to justify a multi-file implementation,
especially since users could also use several files of their own design to
represent datasets and then also have the benefits of multiple unlimited
dimensions.  NetCDF was not designed to be a database system supporting
frequently changing schema, nested transactions, or other such database
features.

>                                        ... On the other hand, deleting an
> instance (i.e., a record in the conceptual equivalent in the CDF parlance) of
> a variable would also change the shape.  Is this supported in netCDF without
> copying?  

No, there is no `delete variable' (or `delete dimension') operation in the
netCDF interface, though we do support `delete attribute'.  The decision not
to provide support in the interface for deleting dimensions or variables was
again a conscious decision that considered the tradeoffs and uses we had in
mind for the interface.  There is also no compression or garbage collection
after an attribute is deleted, except by copying.  You are right to point
out that these operations can be expensive for large datasets represented
as single netCDF files, but our philosophy has been to support the most
common operations efficiently and warn users about what is costly.  Some
datasets are better represented as several medium sized files rather than a
single large file, and this also gives users some flexibility in changing
the data schema.

I'm not convinced we want to add most of the functionality of a database
system including the ability to change the schema efficiently for large
datasets.  The complexity this adds to both the interface and implementation
seems like too high a price to pay, especially when users who need to change
the schema of a netCDF file can do so by copying the data.  Users must put
more thought into the original schema design if they don't have the luxury
of cheap changes to the schema, but that may be an advantage.

--Russ