Re: [cf-pointobsconvention] Draft 2 comments

Hi John,

NetCDF works quite well as a scientific data exchange format. It doesn't work as well as a container format.The data model for in-situ observations is much simpler if you use netCDF to represent one profile/trajectory/sounding and bundle a collection of these observations using some kind of archive format (could be zip or jar or tar or gzip).

There are libraries for reading and writing directly to and from zip archives (I've seen Java, Ruby, and Python versions), so you don't have to zip/unzip each time you modify the archive. You also don't run into the inode exhaustion problem that occurs (as you mentioned) when storing a large number of small files if you use the archive directly.

I already use zip archives for storing netCDF files for our Dapper server. I store several million netCDF files containing profiles from the World Ocean Database, for instance, in zip files that were directly generated (no intermediate files) from a Python script. Some of the zip files contain up to 250000 profiles. If a profile turns out to be bad, it's very easy to remove it from the archive. Haven't had any problems with this scheme.

Cheers, Joe

John Caron wrote:
Hi Joe, comments in line:


Joe Sirott wrote:
Hi John,

Thanks for taking the time to come up with this specification. It
looks like a good start. I do have some concerns about the complexity
of the spec, though, and would like to suggest a few changes that
might make it easier to use.

I believe that this spec is too complicated for most potential
users. For instance, it appears that any software that is able to read
these collections will have to have parse a SQL-like expression in
order to interpret a collection.

Well its a simple syntax: "XXXX <dim_name> XXXX <dim_name> XXXX <variable_name>"

But i only threw it in to have something concrete. One could use 3 seperate attributes.


Another source of complexity is the
varying dimensionality of the dimensions and observations (either 1D
or 2D depending on the type of data).

Yes, actually i think you could probably have any number of dimensions.

Still another example is the use
of character variables for storing attribute data for collections
(should software assume that any character variable is an attribute)?

I dont understand this, do you have an example?


It's also difficult to edit data with this convention. How would I remove an
individual profile from a collection? Or, worse, what if points needed
to be added or removed from an individual profile? I'd have to
regenerate the entire netCDF file in the latter case. That makes this
convention only practical as an archive format.

Some variants are optimal for archival, others for dynamic modification. The backwards linked list is optimal for adding arbitrary amounts of data efficiently, but its pretty bad when you read it. My intention is to give standard options that the user can choose depending on need. If you want to throw me a use case, Ill try to give you a concrete solution.


An alternative would be to store each individual
profile/trajectory/time series in a separate netCDF file. Collections
would consist of a set of netCDF files stored in a zip or jar
file. The zip file could also contain some sort of (XML?) manifest
file that could contain metadata about the collection as a whole. Any
metadata associated with an individual profile would be stored as a
global attribute in the appropriate netCDF file. Editing a profile
would be as simple as extracting the netCDF file from the archive,
rewriting it, and then storing it back in the jar file.

This is a good solution sometimes, but not generally. Many small files are not optimal for large archives. We are having trouble on motherlode right now with excessive inode consumption. Unzipping is too costly if the data is accessed often.


To make it even easier for consumers of this data, I would also
restrict the data type of all variables to double. Also, all four x,y,z,t coordinates
would be required.

I also lean to requiring x,y,z,t coordinates, but others arent so sure. Note this is not the same as having x,y,z,t dimensions. In fact this is a very important part of the proposal that deserves to be highlighted.

Im claiming that the general way to do coordinate systems for this kind of data looks something like

variables;
float lon(obs);
float lat(obs);
float z(obs);
double time(obs);

float dataVar(obs);
 dataVar:coordinates = “lon lat z time”;

rather than follow gridded data conventions like COARDS and use variations of:

float dataVar(t,z,y,x);

I think this is what you are saying below.


Some examples (from your CDL examples):

Collection of point data
------------------------
Unchanged (just one file in archive)

Collection of profile data
--------------------------
For each netCDF file:

variables;
 double lon(1);
 double lat(1);
 double z(obs);
 double time(1);

 double humidity(obs);
 double temperature(obs);
 double pressure(obs);

Collection of trajectories
--------------------------
For each netCDF file:

variables;
 double lon(obs);
 double lat(obs);
 double z(obs);
 double time(obs);

 double humidity(obs);
 double temperature(obs);
 double pressure(obs);

Station time series
-------------------
variables;
 double lon(1);
 double lat(1);
 double z(1);
 double time(obs);

 double temperature(obs);

I think this looks fine, exccept I want to also cover the case where someone needs to put more than one thing in a file.

Thanks for your input.

John