netCDF Conventions

Harvey DAVIES (hld@dit.csiro.au)
Tue, 15 Jul 1997 00:07:04 +1000 (EST)
Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Brian Eaton: "Generalized coordinate variables"
Previous message: Gary Granger: "global map attributes"
Next in thread: Gary Granger: "Re: netCDF Conventions"
It has been pleasing to read all the recent postings on proposed conventions.
A lot of good work has gone into these.  It is obvious that there is a real
felt need to define and refine netCDF conventions.  I hope these comments
of mine will facilitate this process.


ABBREVIATIONS USED

I will refer to Gregory, Drach and Tett's "Proposed netCDF conventions for
climate data" using their initials 'GDT'.  I will abbreviate "netCDF User's
Guide for C" to 'NUGC'.


WHAT KIND OF CONVENTIONS ARE DESIRABLE?

The relevance of many of the issues raised by GDT & others is not restricted
to climate data.  I would like to see some (but not too many) additional
generic conventions adopted with the same status as NUGC Sections 2.3.1
(Coordinate Variables) and 8.1 (Attribute Conventions).  I suggest there
should be a separate chapter in NUGC for conventions, including those now in
Sections 2.3.1 and 8.1.  There is no reason why the only standard names should
be those of attributes.  So dimensions and variables could also have standard
names.

I develop generic software and the existence of non-generic conventions
worries me because my software does not take these into account.  Some
conventions (e.g. using 'lat' as a standard name for latitude) are no problem
if this is just to assist humans.  But my software certainly does not treat
'lat' as special and I suspect it is unreasonable to expect it to do so.

It is important that software documentation specify conventions used.  So, for
example, a particular geographically oriented package might have conventions
such as:
- the maximum rank is 4
- dimensions must be named 'time', 'height, 'latitude', 'longitude'
- dimensions must appear in this order (but some may be omitted)

I consider such conventions too restricting for a field as broad as 'climate
data'.  It would be useful to have lists of conventions adopted by various
packages.  It might then be possible to find a reasonable set of common
conventions.

I fully agree with John Sheldon (ncdigest 405) that "Conventions ought to be
as simple and undemanding as possible".  They should be as few, general and
orthogonal as possible.  I have been disappointed at the poor level of
conformance to the current conventions in NUGC.  Section 8.1 is quite short
and standardises only 15 attribute names (of which several are specified as
ignorable by generic applications).

Yet I have encountered far too many netCDF files which contravene Section 8.1
in some way.  For example, we are currently processing the NCEP data set from
NCAR.  An extract follows. It is obvious that a great deal of effort has gone
into preparing this data with lots of metadata and (standard and non-standard)
attributes, etc.  But it is also obvious that there cannot be any valid data
because the valid minimum (87000) is greater than the maximum short (32767)!
And Section 8.1 states that the type of valid_range should match that of the
parent variable i.e. should be a short not a float.  Obviously the values
given are unscaled external data values rather than internal scaled values.

          short slp(time, lat, lon) ;
                slp:long_name = "4xDaily Sea Level Pressure" ;
                slp:valid_range = 87000.f, 115000.f ;
                slp:actual_range = 92860.f, 111360.f ;
                slp:units = "Pascals" ;
                slp:add_offset = 119765.f ;
                slp:scale_factor = 1.f ;
                slp:missing_value = 32766s ;
                slp:precision = 0s ;

It would be useful to have a utility which checked netCDF files for
conformance to these conventions.  It could also provide other data for
checking validity such as counting the number of valid and invalid data
elements.

I guess I have to take some of the blame.  I was one of the authors of NUGC and
I was largely responsible for rewriting Section 8.1 last year while I was
working at Unidata.  I tried to make it clearer and simpler.  In particular, I
tried to simplify the relationship between valid_range, valid_min, valid_max,
_FillValue and missing_value.  But it seems that we have failed to make the
current conventions sufficiently clear and simple.

And we need to be careful not to make it even harder for the writers of netCDF
files by defining so many conventions that they need to be 'netCDF lawyers'.


NAMING CONVENTIONS AND RECOMMENDATIONS

I like Russ Rew's suggestion of some use for global attributes whose names
match those of variables or dimensions.  But I am not sure what the best use
might be!

I also like Russ Rew's suggestion of allowing periods ('.'s) in names.

I think there should be a recommendation that names consist of whole words
unless there is some strong reason to do otherwise.  So 'latitude' would be
preferred to 'lat'.  Note that such full-word variable names often obviate the
need for a 'long_name' attribute.

GDT suggest avoiding case sensitivity.  I do not think this is the kind of
thing which should be standardised in a convention.  Instead it should be
recommended as good practice.  But it is sometimes natural to use names such
as "x" and "X".

I suggest recommending that that dimension names should be singular rather
than plural.  Thus 'point' is better than 'points'.


COORDINATE VARIABLES

There is a need to allow string values (e.g. Station names) for coordinate
variables.  But I disagree with Russ Rew on allowing 2D NUMERIC coordinate
variables for such things as dates.

Such names are essentially nominal values with no inherant ordering.  But
dates are ordered and more simply represented by a single number than by some
kind of multi-base number.  I am yet to be convinced that any of the
multi-dimensional coordinate variable ideas is basic enough to deserve
adoption.

GDT suggest several new attributes for coordinate variables. In particular my
impression is that they propose representing longitude somewhat as follows:

float lon(lon);
    lon:long_name = "longitude";
    lon:quantity = "longitude";
    lon:topology = "circular";
    lon:modulo = 360.0f;
    lon:units = "degrees_east";

There is a lot of redundancy here, especially if 'lon' is the standard name
for longitude. I would prefer to replace the above by:

float longitude(longitude);
    longitude:modulo = 360.0f;
    longitude:units = "degrees_east";

Here 'longitude' is the standard name for longitude but this is relevant only
to users, not software.  The special properties which software needs to know
about are given by the attributes 'modulo' and 'units'.

The other proposed attributes 'quantity' and 'topology' do not appear to
provide any useful additional information. But I do like the idea of 'modulo'
for cyclic variables such as longitude.  I suggest the monotonicity
requirement should be relaxed if modulo is specified. Instead there should be
a uniqueness requirement.  So the longitudes could be (0 90 180 -90) but not
(0 90 180 270 360) since 360 is equivalent to 0. The uniqueness requirement
would disallow any value included in a previous interval.  So the total range
would have to be less than 360.  The following would be illegal:  (-180 -90 0
315) since 315 is equivalent to -45 which is covered by the first interval
from -180 to -90.

Some cyclic variables (e.g. month of year) have a non-zero origin.  So we
could also have an attribute called say 'modulo_origin' (default 0) as in:

short moy(moy)
   moy:long_name = "month of year";
   moy:modulo = 12;
   moy:modulo_origin = 1;

but I doubt whether this is really worthwhile.

I wish to propose allowing missing (invalid) values in coordinate variables.
All corresponding data in the main variable would also have to be missing.  In
particular this would simplify the problem of calendar dimensions which GDT
discuss.  You could simply allocate 31 days to every month and set data for
illegal dates (e.g. 30 Feb) to a missing value.  Note that the extra space
required is only 1.8%.

I disagree with GDT's suggestion that every dimension have a coordinate
variable.  This would triple the space required for the following time series:

    dimensions:
	time = 1000000;
    variables:
	float time(time);
	short temperature(time);
	    temperature:add_offset = 300.0f;
	    temperature:scale_factor = 0.1f;

Note that it is not possible in this case to use a short for time, since
1000000 different values are needed.

It would be nice (especially for such time series) to have an efficient way of
specifying a  coordinate variable with constant steps i.e. an arithmetic
progression (AP).  I propose doing this by replacing the rule that the shape
of a coordinate variable consist of just a dimension with the same name. The
new rule should allow any single dimension with any size (including 0).
(There is of course also the issue of whether multiple dimensions should be
allowed.)  Then any trailing undefined elements of a coordinate variable
would be defined as an AP as follows:

If have coordinate variable with size > 1 then:
    AP = var(0), var(1), ..., var(size-1), a+d, a+2d, a+3d, ...
	where a = var(size-1)
	and   d = var(size-1) - var(size-2)

If size = 1 then d defaults to 1 so
    AP = var(0), var(0)+1, var(0)+2, var(0)+3, ...

If size = 0 then a defaults to 0 and d defaults to 1 so
    AP = 0, 1, 2, 3, ...

If no coordinate variable then again
    AP = 0, 1, 2, 3, ...

So if the time vector is the AP (100, 100.5, 101, ...)  days, then the above
example could be written as either:

    dimensions:
        time = 1000000;
	zero = 0;
    variables:
        int time(zero);  // datatype is irrelevant
	    time:add_offset = 100.0;
	    time:scale_factor = 0.5;
	    time:units = "days";
        short temperature(time);
            temperature:add_offset = 300.0f;
            temperature:scale_factor = 0.1f;

or:

    dimensions:
        time = 1000000;
	two = 2;
    variables:
        double time(two);
	    time:units = "days";
        short temperature(time);
            temperature:add_offset = 300.0f;
            temperature:scale_factor = 0.1f;
    data:
	time = 100.0, 100.5;


UNITS

I often see netCDF files with units unknown to udunits.  I just want to
underline GDT's specification that the only legal units are those in the
current standard udunits.dat file.  Any other 'units' should be specified in
some other manner e.g. by some non-standard attribute.

I suggest recommending plural rather than singular units (if the unit is an
English word).  Thus 'days' rather than 'day'.  But do not attempt to
pluralise something which is not a word like 'degC'!

John Sheldon in ncdigest 405 suggested allowing 
units = "none".  
I prefer
units = " ",
which already works with current version of udunits. 


STEVE EMMERSON'S POSTING TO NCDIGEST 408

I found Steve's distinction between 'manifold' and 'base' useful.  I agree
that the netCDF coordinate variable convention has caused confusion by using
the same name for both.  The convention works fine for the traditional
(John Caron's "classic # 1") case, but does not generalise naturally.

I look forward to the next installment when Steve finally reveals what in
the world the third element 'WORLD' is!! :-)


JOHN CARON'S MOTIVATING EXAMPLES

I like the idea of this list. 

John's examples 2 and 9 are both examples of non-gridded (irregular or
scattered) data.  The same single index is used in separate vectors to get
coordinates and data values.  Example 3 simply generalises this to
multiple subscripts.  A satellite data example is:

    float radiance(line, pixel);
    float latitude(line, pixel);
    float longitude(line, pixel);

which is essentially the same as:

    float radiance(point);
    float latitude(point);
    float longitude(point);

It seems to me that examples 4 and 10 are just mixtures of these with the
classical 1.  

Example 5 needs more detail.  I assume var has other dimensions.  I seem to
remember Russ Rew suggesting a 2D coordinate variable for this case along the
lines of:

    dimensions:
	latitude = 56;
	longitude = 64;
	level = 10;
	range = 2; 
    variables:
	float level(level, range);  // level(k,0) = bottom, level(k,1) = top
	float var(latitude, longitude, level);

This has some appeal, but it does not seem basic enough to justify
generalising coordinate variables to 2D.

Example 6 has too little information for me to understand.  If you simply want
a non-georeferencing example then why not use Steve Emmerson's (now already
famous) spiral wire example.  But this is essentially the same as 2 and 9.

I found example 8 unnecessarily complex.  I assume corr_var(lat1, lon1, lat2,
lon2) gives the correlation between the point (lat1, lon1) and the point
(lat2, lon2).  A simpler case is the following involving annual precipitation
measured at 100 stations:

    dimensions:
	year = UNLIMITED;
	station = 100;
    variables:
	float precipitation(year, station)
	float precipitation.correlation(station, station)

where precipitation.correlation(i,j) is the correlation between precipitation
at station i and precipitation at station j.

This could also be used in place of Example 7.  The precipitation for each
year is a vector of 100 elements.  The calculation of a correlation matrix
requires that these 100 all be in the same array, rather than 100 separate
variables.

An example I would like to add is the following, which is a variation of one
someone (I forget who) posted recently.  Note that this is gridded data,
unlike the above examples.  Let's call it the 'Sparce Gridded' example, which
in this case obviates the need to store missing values for ocean points in the
main array (at the lesser cost of storing missing values in a pointer array):

    dimensions:
	time = UNLIMITED;
	latitude = 56;
	longitude = 64;
	land_point = 1100; // about 30% of 56*64
    variables:
        float latitude(latitude);
        float longitude(longitude);
	short land_index(latitude, longitude);  // value of 'land_point' index
	    land_index:valid_min = 0;
	    land_index:missing_value = -1;      // -1 = ocean point
	float soil.temperature(time,land_point)


COMMENTS ON SPECIFIED SECTIONS OF GDT

SECTION 5: Global Attributes

I have found the history attribute especially useful for variables calculated
from some other variable (in the same or a different file).  This provides the
information which GDT suggest putting into an attribute called 'contraction'
(See Section 23 and Appendix B).  This raises the possibility of allowing a
history attribute for each variable as well as a global history attribute. The
wording (mine I must confess!) in Section 8.1 of NUGC needs changing to make it
clear that a line should be appended whenever a variable is created or
modified.

I'm afraid I don't like any of the proposed other new attributes.  I prefer
the name 'source' (as in CSM Conventions) in place of 'institution' and
'production'.  The 'conventions' attribute should include any version numbers,
etc. rather than having additional attributes such as 'appendices'.  

I am not convinced of the need for a 'calendar' global attribute.  Calendars
are discussed further below.

SECTION 8: Axes and dimensionality of a data variable

I like the distinction between 'axes' and 'dimensions'.  But it may be too
late to change our terminology.  In fact, the very first sentence in this
section uses 'dimensions' when 'axes' seems to be implied!

I would state that each axis is associated with a dimension and it is possible
for several axes to share the same dimension.  I prefer the normal term 'rank'
to the rather antiquated and clumsy term 'dimensionality'.

The only apparent reason to limit the rank or specify the order of axes would
be for compatibility with specific packages.  One can always use programs such
as my ncrob to reduce the rank or transpose axes to any desired order.

Dimension sizes of 0 (as well as 1) should be allowed.  The most common 0 case
is for the unlimited dimension, but others are occasionally useful (e.g.  my
proposal above defining a coordinate variable as an AP).  This is really just
one possible cause of no valid data - a situation which software should be
able to handle.

SECTION 12: Physical quantity of a variable

I am unhappy with this proposed 'quantity' attribute.  E.g.

    float z;
        z:quantity = "height";

Why not simply standardise the name of the variable?  E.g.

    float height;

In cases where there is a need to distinguish between similar variables, there
could be a convention specifying part of the name.  E.g. temperatures could be
indicated by the suffix '.temperature' as in:

    float surface.air.temperature;
    float soil.temperature;

using periods as suggested in Russ Rew's posting to ncdigest 405.  And if
there were two different latitude variables, these could be named say
latitude.1 and latitude.2.

But I do agree that there is a need for something more than just the 'units'
attribute to give information about the nature of a variable.

In particular a statistical procedure may want to calculate measures of
location, dispersion, etc. appropriate to the level of measurement (nominal,
ordinal, interval or ratio).  For example:

If level = ratio then calculate geometric-mean and coefficient-of-variation.
If level = interval then calculate arithmetic-mean and standard-deviation.
If level = ordinal then calculate median and semi-inter-quartile-range.
If level = nominal then calculate mode.

So I propose an attribute 'measurement_level' with the value "ratio",
"interval", "ordinal" or "nominal".  The default should be "interval", since
- this includes "ratio" and thus covers most physical measurements
- the "interval" property is adequate for most purposes (as seen by the
  ubiquity of the arithmetic-mean and standard-deviation).

I have never been happy with having both the FORTRAN_format and the C_format
giving essentially the same information.  (Although it is usually possible to
derive one from the other.)  It might be better to replace

z:FORTRAN_format = "F8.2";
z:C_format = "%8.2f";

by some language-independent attributes such as 

z:format_type = "F";
z:format_width = 8;
z:format_decimal_places = 2;

and if the variable is scaled (using scale_factor and add_offset) these should
apply to the external value rather than the internal value (as C_format
does).

SECTION 16. Vertical (height or depth) dimension

Is there any reason why one could not simply adopt the single standard
variable name 'height' and handle depths as negative heights?  E.g.

float height;
    height:long_name = "depth";
    height:units = "-1 * metres";  // udunits can handle this

I suspect this also obviates the need for the "positive" attribute. 

SECTION 20. Bundles

I agree that there is a need for string-valued coordinate variables.  These
are an example of nominal measurement level.

The issue of whether a variable is continuous or discrete is related to
measurement level to some extent (A nominal variable cannot be continuous).
But even many ratio variables (e.g.  counts) are discrete.

SECTION 22. Point values versus average values

I agree that this distinction is important. The rainfall example suggest a
third alternative - a value integrated (accumulated) over intervals along one
or more axes.  I don't like the name 'subcell' - it does not have the desired
connotations to me. Maybe something like
    <var>:point_value = 0;   // 0 for false,  1 for true
would be clearer.

SECTION 23. Contracted dimensions

Processes (e.g. sum) which reduce the rank are often called 'reductions'. The
proposal here is to contract dimensions (axes?) to size 1 rather than
eliminating them, so I suppose 'contraction' may be a reasonable term.  This
does document how such a variable was calculated (especially if boundaries are
given via a 2D coordinate variable).  But surely the 'history' attribute
should provide this information.  However as I mentioned before, I can see
that a file with many variables could have a very long global history
attribute and it might be better to also allow each variable to have its own
'history' attribute. (I prefer to limit the number of main variables in a file
to a very small number, usually one.)

The 'history' attribute should provide a complete audit-trail of every task
(e.g. input, copy, reduction, other arithmetic) which created and modified the
variable.  But there are real problems when the whole process involves a
series of tasks creating temporary values in memory, etc.  I suspect the best
solution is to write a full log to the global attribute 'history'.

I can see benefit in standardising the names of reductions (contractions) to
say:

sum
mean or am: arithmetic mean
gm: geometric mean
median
mode
prod or product
count: sum of non-missing values
var: variance
sd: standard deviation
min: minimum
max: maximum

But I would suggest using these as part of a standard variable naming
convention such as that shown by the following examples:

min.soil.temperature
weighted.mean.sea.surface.temperature

The obvious problem here is that names may become inconveniently long.
Perhaps we could use standard abbreviations for the variable name itself, but
use full words for the 'long_name'.  E.g.

float sst(latitude, longitude);
    sst:long_name = "sea surface temperature";
float wam.sst;
    wam.sst:long_name = "weighted arithmetic mean of sea surface temperature";

based on a naming convention for reductions where the prefix 'w' means
'weighted', so 'wsum' means 'weighted sum' and 'wam' means 'weighted
arithmetic mean'.

SECTION 24. Time Axes

The CSIRO Division of Atmospheric Research (DAR) routinely uses two time axes,
'year' and 'month'.  But I agree with GDT that there should be only one time
axis.  But obviously there must be convenient ways of calculating reductions
for particular months, etc.

Again I would prefer to standardise the variable name rather than introduce
yet another attribute i.e. 'quantity'.  I suggest both 'time' and
'<something>.time' should be allowed.

Note that climatologists often uses the units 'year' and 'month' in a
deliberately imprecise manner.  The exact definitions of month and year are
irrelevant.  All that matters is that there are 12 months in a year.

Climate data should normally use a time axis with a unit of a day, month or
year (or some multiple of these).

If the unit is a day then there should be a fixed number (31 for 'normal'
calendars such as Gregorian) days in each month.  The time coordinate variable
should have a missing value for each day which does not exist in the calendar
used.  I think this obviates the need for the 'calendar' global attribute and
allows for most kinds of calendars without having to hard-code them into a
standard.

I agree that date/time should be represented by a single number.  I suggest
the form YYYYMMDD.d where d is a fraction of a day.  So 19970310.5 represents
noon on March 10, 1997.  Similarly year/month is represented by YYYYMM.  But
such values are not acceptable to udunits and therefore cannot be used for
time coordinate variables.

int time(time);
    time:units = "days since 1995-01-01 00:00";
    time:valid_min = 0;
    time:missing_value = -1;      // for day which does not exist in calendar
int YYYYMMDD(time);
    YYYYMMDD:long_name = "date in form YYYYMMDD";
    YYYYMMDD:valid_min = 00010101;
    YYYYMMDD:missing_value = 0;  // for day which does not exist in calendar
data:
    time     =        0,        1, ..,       27,       28,       29,       30,
                     31,       32, ..,       59,       -1,       -1,       -1,
                     59,       60, ..,
		 
    YYYYMMDD = 19950101, 19950102, .., 19950128, 19950129, 19950130, 19950131,
               19950201, 19950202, .., 19950228,        0,        0,        0,
               19950301, 19950302,  ..

Thus a package might provide a (binary?) search function 'index' which could
be used as follows to calculate the arithmetic mean of sst values for JJA
(June, July, August) 1995:

mean = am(sst(index(YYYYMMDD, 19950601) : index(YYYYMMDD, 19950831)))

SECTION 25. Gregorian calendar

I don't like the mixed Gregorian/Julian calendar (with a fixed conversion date
of 1582-10-15) apparently used by udunits.  I would prefer it to assume
Gregorian unless explicitly specified otherwise such as follows:
    units = "days since 1995-01-01 00:00 Julian"

I suspect that the most likely use of the Julian calendar would be for places
such as Russia which I believe used it up until the Revolution early this
century!

But what about other calendars?  There are calendars in China and India which
are still very widely used.  A Chinese oceanographer colleague informs me that
the Chinese calendar is still used in oceanography, in particular for tide
work.  I am not suggesting this is a high-priority item, but udunits should
allow for the incorporation of such calendars in the future.

SECTION 27. Unimonth calendar

I prefer the above "fixed length month with missing days" proposal.

SECTION 31. Invalid values in a data variable

There is a need to clarify terminology.  I use 'missing' and 'invalid'
interchangeably. But I do appreciate that it might be better English to
consider a missing value as a special kind of valid value.  But the NUGC 8.1
conventions state that
- all values outside the valid range should be considered missing values 
- any specified missing_value attribute should be outside the valid range.  
I assume any value inside the valid range is valid and any value outside it is
invalid.  (I must confess that I may be partly to blame for this confusion in
terminology in 8.1.)

So I suspect we need another term for 'bad' data due to some error. Terms
which come to mind include 'error values' and 'bad data'.  Australians
might call it 'crook data'! :-(

The process of validating data should test for such bad data.  But it is
unreasonable to expect generic applications to do more than test whether
values are within the valid range.  And it is preferable to specify only one
of valid_min or valid_max so that the application has to do only one test
rather than two.

As mentioned above, I strongly believe that coordinate variables should be
allowed to have missing values.  Interpolation should act as if the missing
slabs were deleted.

Note that all the following should be of the same data type as the parent
variable:  _FillValue, missing_value, valid_range, valid_min, valid_max.

The final paragraph is similar to the rule in NUGC 8.1 for defining a default
valid range in the absence of any of valid_range, valid_min or valid_max.  It
ensures that _FillValue is invalid.

SECTION 32. Missing values in a data variable

Note that NUGC 8.1 states that missing_value should be ignored by software.
Its purpose is merely to inform humans that this special (invalid) value is
used to represent missing data.  The fact that the value is invalid implies
that it will be treated as missing when read.  So I disagree with the last two
sentences of this section, which refer to software using the missing_value.

Applications will typically write a value of _FillValue to represent undefined
or missing data.  But there is nothing to stop them writing any other invalid
values, including any of the values in the missing_value attribute (which can
be a vector).
Next message: Brian Eaton: "Generalized coordinate variables"
Previous message: Gary Granger: "global map attributes"
Next in thread: Gary Granger: "Re: netCDF Conventions"