[cf-pointobsconvention] Draft 2 comments
Bryan Lawrence
b.n.lawrence at rl.ac.uk
Mon Sep 24 08:10:24 MDT 2007
(not that I'm following this closely, would that I had the time ...)
1) I don't want to unzip a million files, and then do a million file open and
closes to find the location of a million stations and plot them on a
map ... :-( That's my vote for multiple things in one file ...
2) I don't want to be forced to put z in a file (singly or multiple valued)
unless I have it ...
Cheers
Bryan
On Sunday 23 September 2007 01:00:46 Joe Sirott wrote:
> Hi John,
>
> NetCDF works quite well as a scientific data exchange format. It doesn't
> work as well as a container format.The data model for in-situ
> observations is much simpler if you use netCDF to represent one
> profile/trajectory/sounding and bundle a collection of these
> observations using some kind of archive format (could be zip or jar or
> tar or gzip).
>
> There are libraries for reading and writing directly to and from zip
> archives (I've seen Java, Ruby, and Python versions), so you don't have
> to zip/unzip each time you modify the archive. You also don't run into
> the inode exhaustion problem that occurs (as you mentioned) when storing
> a large number of small files if you use the archive directly.
>
> I already use zip archives for storing netCDF files for our Dapper
> server. I store several million netCDF files containing profiles from
> the World Ocean Database, for instance, in zip files that were directly
> generated (no intermediate files) from a Python script. Some of the zip
> files contain up to 250000 profiles. If a profile turns out to be bad,
> it's very easy to remove it from the archive. Haven't had any problems
> with this scheme.
>
> Cheers, Joe
>
> John Caron wrote:
> > Hi Joe, comments in line:
> >
> > Joe Sirott wrote:
> >> Hi John,
> >>
> >> Thanks for taking the time to come up with this specification. It
> >> looks like a good start. I do have some concerns about the complexity
> >> of the spec, though, and would like to suggest a few changes that
> >> might make it easier to use.
> >>
> >> I believe that this spec is too complicated for most potential
> >> users. For instance, it appears that any software that is able to read
> >> these collections will have to have parse a SQL-like expression in
> >> order to interpret a collection.
> >
> > Well its a simple syntax: "XXXX <dim_name> XXXX <dim_name> XXXX
> > <variable_name>"
> >
> > But i only threw it in to have something concrete. One could use 3
> > seperate attributes.
> >
> >> Another source of complexity is the
> >> varying dimensionality of the dimensions and observations (either 1D
> >> or 2D depending on the type of data).
> >
> > Yes, actually i think you could probably have any number of dimensions.
> >
> >> Still another example is the use
> >> of character variables for storing attribute data for collections
> >> (should software assume that any character variable is an attribute)?
> >
> > I dont understand this, do you have an example?
> >
> >> It's also difficult to edit data with this convention. How would I
> >> remove an
> >> individual profile from a collection? Or, worse, what if points needed
> >> to be added or removed from an individual profile? I'd have to
> >> regenerate the entire netCDF file in the latter case. That makes this
> >> convention only practical as an archive format.
> >
> > Some variants are optimal for archival, others for dynamic
> > modification. The backwards linked list is optimal for adding
> > arbitrary amounts of data efficiently, but its pretty bad when you
> > read it. My intention is to give standard options that the user can
> > choose depending on need.
> > If you want to throw me a use case, Ill try to give you a concrete
> > solution.
> >
> >> An alternative would be to store each individual
> >> profile/trajectory/time series in a separate netCDF file. Collections
> >> would consist of a set of netCDF files stored in a zip or jar
> >> file. The zip file could also contain some sort of (XML?) manifest
> >> file that could contain metadata about the collection as a whole. Any
> >> metadata associated with an individual profile would be stored as a
> >> global attribute in the appropriate netCDF file. Editing a profile
> >> would be as simple as extracting the netCDF file from the archive,
> >> rewriting it, and then storing it back in the jar file.
> >
> > This is a good solution sometimes, but not generally. Many small files
> > are not optimal for large archives. We are having trouble on
> > motherlode right now with excessive inode consumption. Unzipping is
> > too costly if the data is accessed often.
> >
> >> To make it even easier for consumers of this data, I would also
> >> restrict the data type of all variables to double. Also, all four
> >> x,y,z,t coordinates
> >> would be required.
> >
> > I also lean to requiring x,y,z,t coordinates, but others arent so
> > sure. Note this is not the same as having x,y,z,t dimensions. In fact
> > this is a very important part of the proposal that deserves to be
> > highlighted.
> >
> > Im claiming that the general way to do coordinate systems for this
> > kind of data looks something like
> >
> > variables;
> > float lon(obs);
> > float lat(obs);
> > float z(obs);
> > double time(obs);
> >
> > float dataVar(obs);
> > dataVar:coordinates = “lon lat z time”;
> >
> > rather than follow gridded data conventions like COARDS and use
> > variations of:
> >
> > float dataVar(t,z,y,x);
> >
> > I think this is what you are saying below.
> >
> >> Some examples (from your CDL examples):
> >>
> >> Collection of point data
> >> ------------------------
> >> Unchanged (just one file in archive)
> >>
> >> Collection of profile data
> >> --------------------------
> >> For each netCDF file:
> >>
> >> variables;
> >> double lon(1);
> >> double lat(1);
> >> double z(obs);
> >> double time(1);
> >>
> >> double humidity(obs);
> >> double temperature(obs);
> >> double pressure(obs);
> >>
> >> Collection of trajectories
> >> --------------------------
> >> For each netCDF file:
> >>
> >> variables;
> >> double lon(obs);
> >> double lat(obs);
> >> double z(obs);
> >> double time(obs);
> >>
> >> double humidity(obs);
> >> double temperature(obs);
> >> double pressure(obs);
> >>
> >> Station time series
> >> -------------------
> >> variables;
> >> double lon(1);
> >> double lat(1);
> >> double z(1);
> >> double time(obs);
> >>
> >> double temperature(obs);
> >
> > I think this looks fine, exccept I want to also cover the case where
> > someone needs to put more than one thing in a file.
> >
> > Thanks for your input.
> >
> > John
>
> _______________________________________________
> cf-pointobsconvention mailing list
> cf-pointobsconvention at unidata.ucar.edu
> For list information or to unsubscribe, visit:
> http://www.unidata.ucar.edu/mailing_lists/
More information about the cf-pointobsconvention
mailing list