Review of Proposed NetCDF Conventions

John Sheldon (jps@GFDL.GOV)
Tue, 8 Jul 1997 18:12:58 -0400 (EDT)

Hi again Jonathan,

As promised, I am sending along a more detailed review of your proposed
netCDF conventions. (For those reading along, this proposal is
available at http://www-pcmdi.llnl.gov/drach/netCDF.html)  This is
prefaced by a few general comments, along with some musings about how
netCDF might best satisfy the needs of users and applications
developers.

General Comments
----------------

 1. Conventions ought to be as simple and undemanding as possible,
     to make the use of netCDF as easy as possible.  This may sound
     like a platitude, but one reason for netCDF's popularity to date
     has been the ease with which users can get started.  And we all
     know how critical it is for the viability of a software product
     that it first be widely popular.

 2. We (the netCDF community, or at least the ocean-atmosphere subset
     of us) might want to consider defining a "core" set of basic
     conventions, with extensions to suit specific groups.  These core
     conventions should be broadly applicable and include basic
     principles for storing and documenting certain types of
     information.  Groups specializing in climate, NWP, oceanography,
     observational datasets, etc. could define additional conventions to
     facilitate data exchange amongst themselves, so long as the files
     are consistent with broader conventions.

 3. As I mentioned in my previous mail, I am, in general, opposed to
     the use of external tables.  While they can be handy for centers
     which exchange much the same data all the time, it is problematic
     for researchers who can sometimes get quite "imaginative" and end
     up with things that aren't in the "official" table.  Complicating
     things is the fact that there often tends to be more than one
     "official" table. However, the fact that you are not replacing a
     description with a code from a table reduces the problem
     tremendously, so I'll go along on this one given it's erstwhile
     utility.

 4. There seems to be a preference in your proposal for associating
     additional qualities with the axes/coordinate-variables themselves
     (eg, contraction, bounds, subcell, ancillary coordinate
     variables).  While this might be a clever way to associate the
     added info with a multitude of data variables, it may also lead to
     the expansion in the number of dimensions, since all this
     additional information may not be applicable to every data
     variable employing that dimension.  In that case, a new dimension
     which is essentially identical to an existing dimension will have
     to be created.  The alternative, historically, has been to use
     referential attributes attached to the data variables to
     specifically associate alternate sources of information.  (See
     http://www.unidata.ucar.edu/software/netcdf/NUWG/draft.html) These
     are also more general, as they are not limited to 1-D.

 5. Your proposal does not rule out the use of referential attributes,
     but neither does it endorse or exploit them. Any particular
     reason?  More generally, it would certainly be helpful (and brave)
     if you would let us all know your thoughts along the lines of the
     recent discussion concerning multidimensional coordinates.

 6. Please, give us some (many!) CDL examples!



Specific comments:
-----------------

 - Section 3: Data types
    I am not really a Comp-Sci person, so it's possible I'm missing
    something critical about "byte" vs. "char" here. I've already
    learned to cope with the signedness differences between our SGI's
    and our Cray T90, and it didn't seem that difficult.  But the
    proposed change does mean existing applications will have to be
    modified, never an exciting prospect.  Also, I'm not sure how to
    handle "char" in any mathematical expressions (eg, decompression).
    What is it that "may become ambiguous in future versions of netCDF"
    that is driving this?

 - Sections 8 Axes and dimensionality of a data variable
	    9 Coordinate variables and topology
    In the spirit of simplicity, I don't think I would make storage of
    coordinate variables mandatory if they are simply 1,2,3,... (though
    *we* always do store them).  The generally accepted default of
    assigning coordinates of "1, 2, 3, etc." seems reasonable, and most
    software already seems to handle this.

    I suppose the ability to define 0-dimensional variables could come
    in handy, though such a quantity is probably more appropriately
    stored as a global attribute.  At least one plotting package here
    (Ferret) cannot handle this, however.  BTW, 0-D variables are an
    extension to COARDS - you do not call attention to this.

    I can see that there there might be some use for variables with
    more than 4 dimensions, but this is likely to frustrate some
    existing utilities.

    I very much like singleton dimensions. People and utilities are too
    quick to toss a dimension when "contracting" (eg, averaging,
    extracting), when there is still usable placement information.

 - Section 11: Units
    Exploiting the units database of UDUNITS is fine, but I am less
    comfortable relying on the syntax of any particular package.  What
    does this gain us, especially if this is not an approved method of
    "compression" (although it does serve as such)?

 - Section 12: Physical quantity of a variable
    - "units": I would like to see "none" added as a legitimate
       characterization, as it would serve as a definite affirmation
       that the variable really does have no units.
    - "quantity": any time it is proposed that something be made
       *mandatory*, I have to consider it long and hard.  In this case,
       it seems that the existing COARDS approach is already adequate
       for deducing the sense of lat/lon/vertical/time dimensions.  Why
       is "quantity" so necessary?  There also seems to be a potential
       failure mode here in that someone could encode incompatible
       "units" and "quantity".  Nevertheless, I must concede that use
       of a "quantity" out of some "official" table would make the
       precise definition less prone to misinterpretation.
    - "modulo": simple as it is, this is the clearest explanation I've
       seen.

 - Section 16: Vertical (height or depth) dimension
    First, it seems as though you are proposing that the utility of the
    COARDS "positive" attribute be replaced by the sense conferred on
    the axis by its "quantity".  If so, I don't agree.  The presence of
    "positive" in the file is preferable to a look-up in an extra
    table.

    Second, the direction of "positive" is not merely a display issue.
    It is a critical piece of information that defines the "handedness"
    of the coordinate system being defined.

    Third, I'm not sure that the vertical dimension is necessarily
    recognizable from the order of the dimensions.  What about a "mean
    zonal wind" that is Y-Z?

    Fourth, you rightly note that requiring units for a vertical
    coordinate which has no units "means defining dimensionless units
    for any dimensionless quantity one might wish to use for that
    axis".  However, rather than be concerned with some inconsistency
    of treatment relative to data variables, this brings up a larger
    issue, namely:  How does one recognize that an axis is "vertical"
    if it is not a standard "quantity" and does not employ units that
    look "vertical"?  *Furthermore*, how does one recognize what
    direction *any* axis points in if the "quantity" is not
    georeferenced and the units are nondescript?  For example, a
    channel model here uses Cartesian horizontal coordinates
    (units=meters) and a "zeta" hybrid vertical coordinate.  Our local
    solution to this dilemma is to attach an attribute "cartesian_axis"
    to each coordinate variable that indicates to downstream
    applications which (if any) cartesian axis each dimension is
    associated with (values are "X|Y|Z|T|N").  Without this
    information, we'd have to simply assume that the axes are X-Y-Z
    order (ie, we can't tell that "zonal mean wind" is oriented Y-Z).

 - Section 17: Ancillary coordinate variables
    You might want to emphasize that this is a lead in to sections
    18-21, which are different kinds of "ancillary coordinate
    variables".  One possible problem with the proposed definition is
    that it is limited to 1-D.  Thus, even if I calculate and store the
    altitude of all the points in my sigma model, I can't associate it
    with the arrays of temperature in sigma coordinates, etc.

    Another possible problem is that this ancillary information might
    not be applicable to all data variables employing that dimension.
    (See general comments above.)  

 - Section 19: Associated coordinate variables
    There is already a mechanism for "associating" additional
    coordinate information without requiring yet another defined
    attribute name: Referential attributes.  I have typically seen them
    attached to data variables, but I see no reason why they could not
    be attached to main coordinate variables, too.

 - Section 21: Boundary coordinate variables
    Is there any particular reason why you made the additional
    dimension the slowest varying dimension?  The information is all
    there, of course, but my intuition would like to see the min's and
    max's interleaved.

 - Section: 23: Contracted dimensions
    I definitely like the idea of a "contraction" attribute to
    document, in a general way, the operation that was performed.
    Although I haven't tried either this approach or that from the NCAR
    CSM conventions, I think this will be more general.  We should,
    though, get together and agree on a set of valid strings (eg, "min"
    vs. "minimum").

    However, there might be a problem with the assertion that the
    contracted dimension is of size 1.  How would I store, and
    document, say, a time-series of monthly means?

    I'm still not sure I understand simultaneous contractions. A CDL
    example would help here.

    I am still trying to figure out how I would use these bits of
    metadata to store some data that I, personally, have had to deal
    with.  We take 3-D winds and calculate kinetic energy, then take a
    vertical average (mean) over pressure from some pressure level
    (say, 50mb) down to the earth's surface.  Now, since the surface
    pressure varies two-dimensionally, it seems that a dimension, being
    1-D, will not be adequate to store the information about the
    integration bounds.  Any idea how I would do this? 

 - Section 24: Time axes
    Suppose I have an idealized model that is not associated with any
    calendar - it just ticks off hour after hour.  How would I be
    allowed to specify time is this case?

 - Section 26: Non-Gregorian Calendars
    You picked the string "noleap" to indicate a calendar with 365 days
    per year, but UDUNITS already has "common_year" defined for this.
    Any particular reason for not using that?

    Also, I would like to lobby to add "perpetual" (some of our
    experiments use, eg, perpetual-January conditions), "none" (see
    above), and "model".  A "model" calendar would take the place of
    your "360" and would allow for some number other than 360.  Of
    course, you'd need an additional auxiliary attribute to specify
    just what that number is, maybe in terms of "days per month".  For
    a "perpetual" calendar, you'll also need an additional attribute to
    indicate the Julian day of the year for which the "perpetual"
    conditions are held constant.

 - Section 27: Unimonth calendar
    I see the attraction of being able to convert between any calendar
    and "unitime".  This is the same thing many of us do when we store
    date/time as YYYYMMDDHHNNSS.  I'm not opposed to this, but I
    wouldn't want to use it in place of a UDUNITS-type representation.
    Hopefully, UDUINTS will someday handle additional calendars. (Hint,
    hint!)

    One difficulty with this sort of an axis is that an otherwise
    continuous coordinate now become non-continuous; ie, there are lots
    of what appear to be "breaks" in the timeline.  Utilities can be
    made to realize that some processing is needed, but this will
    require more work.

    A (much) more significant difficulty w.r.t. calendars is comparing
    variables which use two different calendars; eg, a climate model
    run using 360-day years vs. observed data using the Gregorian
    calendar.  As far as I know, there really hasn't been any definitive
    method laid out for doing this.  My intuition tells me that there
    really is no universal way to do it, since these two "planets"
    don't really move the same way - each quantity is going to have
    problems with one cycle or another.  You might still have some luck
    defining a method for certain types of quantities by converting to,
    say, "fractional year". Annual and seasonal (ie, 1/4-year) averages
    might be OK.  But how would you define and compare June averages
    from planet "A" with planet "B"?  Suppose you wanted to calculate
    4xdaily difference maps over the course of a year?  If you do it
    based on the definition of a day, one planet gets back to winter
    slightly ahead of the other.  If you do it in some sort of
    fractional year, the diurnal cycles get out of synch.  Any ideas?

 - Section 28: Intervals of Time
    Here we might have a conflict with my desire to be able to store
    time simply as "days" or "hours" in the case of an idealized model
    which has no calendar-sense. (See section 24 comments.)  With your
    proposal, such units would be interpreted as an *interval* of
    time.  Of course, that's sort of what it is if you take "time" as
    being relative to the start of an integration, but I don't think
    that's what you want, since one might still wish to calculate a
    time "interval" for idealized cases, too.

    I'm not sure how else one might handle this, though...

    Also, unless I'm missing something, storing "monthly" data by
    defining a "unitime" axis with units of "days" doesn't necessarily
    buy us more than a "time" axis with units of "months".  Both are
    "non-standard" units that can be interpreted only after determining
    the calendar type.

 - Section 29: Multiple time axes and climatological time
    It took me some time to grasp (I think) what you are doing here.
    This seems a clever way to specify the time of data points by using
    a set of "benchmark" points and another set of points that measure
    time "wrt" those benchmark points.  But some CDL examples
    demonstrating its usefulness are critical here.  It would seem that
    the same information could be stored using other constructs in your
    proposal, without the (significant) complications introduced by a
    two-dimensional time coordinate.  What would a time series of June
    means look like with and without "wrt"?

 - Section 30: Special surfaces
    Here, again, I am uncomfortable with the use of an external table,
    at least in the context of any "core" conventions.  If it were
    necessary to document a variable which is located at, eg, the
    earth's "surface", one could use a referential attribute to record
    the vertical location of each point:
    
         dimensions:
	     lon = 20;
	     lat = 10;
	     zsfc = 1;
	 
	 variables:
	     float tstar(zsfc,lat,lon) ;
	           tstar:zsfc="zstar" ;

	     float zstar(lat,lon) ;
	     float lat(lat) ;
	     float lon(lat) ;


 - Section 31: Invalid values in a data variable
    As I mentioned before, I heartily agree with your distinction
    between "invalid" and "missing" values.  In practice, both are (or
    should be) outside the "valid_range", in which case (most)
    utilities/packages know to ignore them.  But this is real, valuable
    information content that has not been exploited in any conventions
    to date.

    I'm not sure I agree with the inferences you are requiring on the
    part of generic applications when it comes to using _FillValue to
    define a default "valid_range".  I guess if I were writing such an
    application, this could make some sense.  Still, the definition of 
    _FillValue as an invalid value should be enough to simply tell me
    to ignore that point, nothing more.  

 - Section 32: Missing values in a data variable
    I think that the data should be checked against the "missing_value"
    *before* unpacking.  First, I think there is already a pretty
    strong convention that "missing_value" be of the same type as the
    data.  Second, some packages simply display the packed values, and
    they wouldn't be able to detect missing values. Third, I've been
    burned and confused often enough by varying machine precision to be
    quite shy of comparing computed values.

    However, handling missing values when unpacking packed data does
    present a real problem!  Imagine a subroutine which unpacks, say,
    SHORT values into a FLOAT array.  This routine will be able to
    reliably detect missing values, but what value is it to put in the
    FLOAT array?  We solve this by storing a global FLOAT attribute
    which specifies this number.  If a file has no such attribute, we
    stuff a default value in it.  In any case, we inform the user of
    what was used.

 - Section 33: Compression by gathering
    This seems like a compact way to store variables with large areas
    of no interest, but it is kind of complicated. Something like the
    following might be more intuitive:
    
         dimensions:
	     lon = 20;
	     lat = 10;
	     landpoint=96;
	     zsfc = 1;
	 
	 variables:
	     float soilt(zsfc,landpoint) ;
	           soilt:zsfc="zstar" ;
		   soilt:landpoint="mask" ;  // keyword = "mask" means to 
		   soilt:mask="landmask" ;   //  look for attrib="mask"

	     float zstar(lat,lon) ;
	     byte  landmask(lat,lon) ;   // contains 96 1's, the rest 0's
	     float lat(lat) ;
	     float lon(lat) ;

------------------------------------------------------------------------

That should do it!  I hope I don't sound too negative. Quite the
opposite, in fact: I sincerely hope that your work prompts an update
to the existing conventions.  Much of what you propose is new, and
quite necessary.

Cheers-

John P. Sheldon 
(jps@gfdl.gov) 
Geophysical Fluid Dynamics Laboratory/NOAA 
Princeton University/Forrestal Campus/Rte. 1 
P.O. Box 308 
Princeton, NJ, USA  08542

(609) 987-5053 office
(609) 987-5063 fax
---
    No good deed goes unpunished.
---