Re: "Contractions"

Jonathan Gregory (jmgregory@meto.gov.uk)
Fri, 02 Jan 1998 18:32:36 +0000 (GMT)

Dear John

Happy New Year!

Thanks for your lengthy contribution to our discussion about representing
contracted axes for time coordinates. You raise some useful points about what
information we need to record.

The main difference between your approach and ours is that you have chosen to
attach information about the contracted axis to another axis, whereas we
proposed to store it as a singleton axis of its own. For instance, in the first
example of a timeseries of 12 monthly average temperatures derived from daily
means,

* you have Tavg(avgtime,lat,lon), where avgtime is a 12-element time axis, and
has attributes describing how each month was derived from days, whereas

* we have Tavg(day,month,lat,lon), where month has a dimension of 12 and day a
dimension of 1, the latter recording the information about the contraction of
daily means into a monthly mean.

One reason for adopting your approach is that you say that existing software
would not be able to handle our approach, with its two (or more) time
axes. That would be undeniably an obstacle to adopting our scheme - what is the
problem, in fact (generically)?  Apart from this problem, you argue that our
approach is more difficult to understand. I would like to argue that it has
important advantages of flexibility and consistency.

* It is flexible because it can easily be extended. In your system, you can
describe a timeseries of 12 monthly means, each derived from daily means; and
you can describe a timeseries of climatological daily means for particular days
of the year, each derived from a mean of corresponding days from several
years. But I do not see how you can describe a combination of these two
e.g. the climatological maximum daily mean for each month of the year, i.e. (1)
for each day, calculate the mean; (2) within each month, find the maximum daily
mean; (3) compute the mean of these maxima for corresponding months over
several years. This is represented in our scheme by three time axes
(day,month,year), where both day and year are contracted singleton axes. I
think your inclusion of the attribute contraction_itemiscontraction indicates
that you recognise a need for contractions within contractions; but I would
argue that information of the same detail and kind needs to be recorded for
each one, and it is simplest to use the same structures to do it.

* It is consistent because it is the same as the approach we propose for
contracted spatial axes. In our scheme, from a two-dimensional variable
(lat,lon) with lat=72, lon=96, we derive a zonal-mean field (lat,con_lon), with
con_lon=1 as a contracted axis recording information about the range and
spacing of longitudes over which the average is formed.  This is exactly the
same as our treatment of a time contraction.  In your example 5, you have
treated a contraction of a pressure axis in just this way, leaving a contracted
singleton axis. But in your treatment of time contractions, you don't do this:
you record the contraction using attributes on the remaining uncontracted time
axis, not using a contracted singleton axis.

>From several of your mails, it is clear that you regard our scheme of multi-
dimensional time axes as difficult to understand and to process. In fact, I
didn't think of it as "multidimensional" in the first place; I regard it as
more of a "decomposition". Anyway, perhaps I could propose an alternative which
might be a bit easier to come to grips with. I suspect that one of the main
reasons why it seems complicated is that we have to add up the coordinates in
the various time axes. The reason why we do this is that it's more general.  It
allows us to describe, for instance, a mean of five 17-day periods spaced at
63-day intervals. But maybe this is unnecessary. The reason why time needs
special handling is because it has two natural cycles (seasonal and diurnal)
and we frequently want to contract over these. In addition, the double-
contraction example above shows that the within-month "cycle", while not
natural (I think the lunar cycle is only a cultural link), is fairly common in
climatology. But my 17- and 63-day example here does not refer to any of these
cycles, and so is no more likely than wanting to make periodic means of
arbitrary length in some other coordinate, such as longitude. We have not made
special provision for such means. Maybe we should in fact make our currently
handling of multidimensional time into something of more general application
for arbitrary linear axes.

In that case, as regards time specifically, we can restrict our attention to
those natural cycles. It is then convenient for us just to "decompose" it into:
year, day-within-year, and time-within-day. Day-within-year may further be
decomposed into month and day-within-month, if necessary. These quantities can
be put back together straightforwardly. Would you be any happier with dimensions
(year,mmdd,hhmmss) than a "three-dimensional time axis"? In terms of what you can
do with them, there is no real difference; the difference lies in the
representation of the coordinates and hence how you combine them. With this
scheme,

(a) 12 monthly means derived from daily means would have a singleton year axis
(not a contraction, presuming they came from a single year), a 12-element mmdd
axis, and a singleton hhmmss axis, the result of contracting an axis with
separate elements for each day. I feel that once the axis is contracted it is
not necessary to say how many days were in each month.

(b) the climatological daily means for particular days of the year, each
derived from a mean of corresponding days from several years, would be
described by a contracted year axis and an uncontracted mmdd axis.

(c) the climatological maximum daily mean for each month of the year would have
a contracted year axis, an uncontracted month axis, a contracted
day-within-month axis (the "maximum" contraction), and perhaps a contracted
hhmmss axis showing how each daily value was obtained.

When we made our GDT proposal, we were thinking principally about data
exchange. Our criterion for what metadata to include was therefore to suggest
the minimum necessary to distinguish quantities which one might want to give to
another climate centre e.g. for CMIP. For this reason, we chose not to keep
much information about what the coordinates were before the contraction. Our
proposal only records the range and the minimum and maximum spacing of these
coordinates. We thought that, for instance, it would be sufficient to describe
a quantity as a vertical average of relative humidity between 100 mbar and 850
mbar made from levels having separations of between 50 mbar and 100 mbar
(say). We considered it unlikely that in a single dataset one would have two
different sets of levels which upon contraction would have just the same
description in these terms; hence we decided there was no need to record any
more information for the sake of distinguishing variables.

If you broaden the uses of the convention, and want to adopt it as a general-
purpose data format, I agree that you might sometimes want to record more
information. I think your categories of continuity and placement are sensible,
and I would be in favour of including optional attributes of these names, with
the possible values you suggest of "contiguous" vs "disjoint" and "uniform" vs
"arbitrary".

However, if you actually want to record the original uncontracted coordinates,
as you do in your examples 4 and 5, I still prefer the suggestion I made last
time of using a separate axis from the contracted axis, and pointing to it with
an attribute (e.g. "expand") of the contracted axis. In my view, the
contraction is unaffected, and should still be represented as a singleton
axis. What is different is that additional information is being supplied about
the axis before it was contracted.  I prefer this approach because it avoids
defining new kind of attributes for this purpose, when one can reuse all the
definitions we already have for supplying axis coordinates, together with their
bounds and perhaps components and so on. The axis named by the expand attribute
of the contracted axis would be identical in all attributes to the axis before
it was contracted.

I would argue also against encoding the uncontracted coordinates, in the
"uniform" case, in terms of a starting value and step. This is indeed very
tempting. Why don't we do it, then, for ordinary axes? Many of our axes have
evenly spaced coordinates, and we record them explicitly in our coordinate
variables, when we could instead do it with a starting value and a step. I
think the reason we don't do it is because it complicates the code. We cannot
avoid having to handle the case of arbitrary coordinates.  If we decide to
support start-and-step as well, we either write two separate blocks of code, or
we expand start-and-step into a vector of coordinates before processing it
further. In either case, extra programming is needed. This is an overhead, and
it's not worth doing because the amount of space saved by encoding a coordinate
variable in this way is trivial. In most cases, the space required by
coordinate variables is dwarfed by that of data variables. I feel that it is
simpler and better to store all coordinate variables explicitly, and I think
the same arguments apply to the record of the coordinates before contraction,
if you want to store them. This approach also avoids defining more kinds of
attributes.

In cases 4 and 6, you have introduced another idea GDT did not consider.  Here
you wish to record an operation of meaning the data points in groups of varying
size according to the value of some other data variable (cloud cover). I think
it is unlikely that you would need to supply this kind of information to
distinguish between quantities, so I would suggest that we don't need to deal
with it for the moment in a convention. I tend to feel that this is rather a
specialised operation which does not comfortably fit into the general framework
of contractions, as we proposed them, since it does not result in a singleton
axis. Your representation of this operation also has the slight awkwardness of
having to guess a maximum size for these groups, and needing to supply null
values to fill up a two-dimensional array.I To avoid this, I'd like to suggest
an alternative way of doing it for your consideration:

  dimensions:
    time=10;
    avgtime=3;
  variables:
    double time(time);
      time:units="hours since 1-1-1990";
    long timeindex(time);
    double avgtime(avgtime);
      time:units="hours since 1-1-1990";
    float Tavg(avgtime);
      Tavg:quantity="temperature";
      Tavg:units="deg_K";
      Tavg:contraction="mean";
      Tavg:group="timeindex";
      Tavg:expand="time";
  data:
    time=0.5,2.0, 3.5,4.0,4.25,4.5, 5.5,7.0,8.0;
    timeindex=0,0, 1,1,1,1, 2,2,2,2;

This scheme proposes that the presence of a "group" attribute means that the
"contraction" has not reduced an axis to size one, but has first divided it up
into groups and then contracted each group. The variable named by the group
attribute shows the allocation of the original axis into groups. As before, the
variable named by the expand attribute gives the uncontracted coordinates. I
am think of this as reminiscent of SQL, contrasting plain "select avg(time)"
with "select avg(time) group by timeindex", where time and timeindex are two
columns of a table.

Best wishes,

Jonathan