Re: Replies to comments on GDT

John Sheldon (jps@gfdl.gov)
Fri, 1 Aug 1997 11:53:42 -0400 (EDT)

Hi again Jonathan-

Sorry for the long delay in getting back to you.  However, with the
exception of a couple of topics needing some clarification, I thought
that we were in pretty good agreement on just about all of the issues
we covered in our (rather lengthy) comments/replies. Unfortunately,
I seem to be continuing in the tradition of lengthy replies...sorry!

First, let me dispense with a few of the more minor issues and then get
into some more "thorny" topics...

> > Take a vertical average (mean) over pressure from some pressure level (say,
> > 50mb) down to the earth's surface.  Now, since the surface pressure varies
> > two-dimensionally, it seems that a dimension, being 1-D, will not be adequate
> > to store the information about the integration bounds. [JS]
> 
> I agree, this is a problem. You really want to be able to store a contracted
> pressure coordinate with "bounds_pressure=surface,50 mb;". One way - rather
> contrived - I can see to do this is suggested by the use of a hybrid vertical
> pressure-sigma coordinate, where the pressure at level i is given as
> "pressure=pressure_level(i)+sigma_level(i)*surface_pressure". Your upper level
> is a pure pressure level, with pressure_level=50 mb, sigma_level=0. Your lower
> level is a pure sigma level with pressure_level=0 mb, sigma_level=1.  This
> could be recorded in our conventions using a component attribute, thus:
> 
>   dimensions:
>     con_eta=1;
> 
>   variables:
>     float ke(con_eta,lat,lon);
>     float con_eta(con_eta);
>       con_eta:component="pressure_level,sigma_level";
>       con_eta:quantity="hybrid sigma-pressure";
>       con_eta:contraction="mean";
>     float pressure_level(con_eta);
>       pressure_level:quantity="pressure";
>       pressure_level:units="Pa";
>       pressure_level:bounds="bounds_pressure_level";
>     float bounds_pressure_level(2,con_eta);
>     float sigma_level(con_eta);
>       sigma_level:quantity="sigma";
>       sigma_level:bounds="bounds_sigma_level";
>     float bounds_sigma_level(2,con_eta);
> 
>   data:
>     bounds_pressure_level=0,50;
>     bounds_eta_level=1,0;
> 
> This approach does require the application to know how to work out the
> pressure, if it needs to, from the pressure_level and the sigma_level.

Wow! That was very clever!  But also very complicated!  And specific to
the case where the limits are sigma=0 to 1.  I, myself, can't think of
any other way to do it, so I suspect I would likely take the coward's
way out and just add information to the "long_name" or a "comment"
describing the average.  I'll have to think about it some more...


> Section 24: Time axes
> 
> > Suppose I have an idealized model that is not associated with any calendar -
> > it just ticks off hour after hour.  How would I be allowed to specify time
> > is this case? [JS]
> 
> I think you specify it as a pure interval of time, with units="hours". There
> is no problem with this, because if it is really not associated with any
> calendar you will not want to add it to a date, which is when the problems
> arise.

The reason I raised this issue was that you seemed to propose that the
absence of a "reference" time implied a time *interval* (section 28).
Again, this is no big deal, since all "time" is really an interval from
some reference time - in this case the beginning of a model
integration - but it is possible that someone might misinterpret it
somehow.


> Section 26: Non-Gregorian Calendars
> 
> > UDUNITS already has "common_year" defined indicate a calendar with 365 days.
> > [JS]
> 
> We used "noleap" for compatibility with LATS.

I'd prefer to drive the conventions with UDUNITS than with LATS.

> Section 27: Unimonth calendar
> 
> The problem we are trying to address here is that of applications trying to
> convert between different representations of time, where
> ...
> This may help a bit with John Sheldon's issue of comparing calendars, although
> it is still a thorny problem.

I don't think it will help with comparing calendars.  One still cannot
compare, say, observed data for December 31 to a 360-days-per-year
model, because the latter simply has no "December 31".


> Section 32: Missing values in a data variable
> 
> > I think that the data should be checked against the "missing_value" *before*
> unpacking. [JS]
> 
> Yes, you may well be correct. Thanks.

The problem then becomes: what will you put in the array of unpacked
data if you find a missing value in the packed data?  We store a global
attribute to hold this value (say, -1.E30).  In the absence of this
global attribute, we simply stuff in a fill-value, which is OK, but you
lose the distinction between intentionally and unintentionally missing
data.  In any case, we tell the calling routine what float values we
used in both cases.


----------------

Now, the tough stuff...

Mostly, this is an issue of storing averages (at least it is for us).
But one can't really address this without at least some consideration
of coordinate systems and multi-dimensional coordinates.  Without
getting too deeply into the details, let me take a minute summarize my
thoughts at the moment.

 ( John Caron's recent mail ("songs about coordinate systems and
   buildings") came in while I was composing this.  I hate to make this
   mail even *longer* >:-{, so I'll address that mail separately,
   though it is obviously related to what I'm discussing here.   )


----
Re: multidimensional coords

   - This is not our (ie, GFDL's) top priority now
   - Our main concern is that existing applications don't break if
        we want to use them someday and also want abide by conventions
   - We *do* use these in one significant instance: Trajectories
           eg:  Q(i) w/ {X(i),Y(i),Z(i),t(i)}
      but this is a relatively straightforward, non-controversial case

----
Re: Coordinates, in general:

 I think we should have some way to:
  - specify the coordinate system occupied by a variable 
      (Cartesian, cylindrical, spherical, etc.)
  - specify coordinate combinations, a la Gray Granger's proposal.
      Note that this is not intended to be exclusive; ie user can try
      other combinations, if appropriate.
  - preserve the author's power to mandate coordinate system
      (ie, can't be purely the choice of the viewer or application)

  Perhaps some combination of coordinate-system and axis-combination
  specification could be devised, such as in the following
  (abbreviates) example:

    dimensions:
       i=100;

    variables:
       float QC(i);
             QC:coordinates="cartesian_1" // point to a "map", ala Granger
       float QG1(i);
             QG1:coordinates="geodetic_1" // first part of string 
       float QG2(i);                      //  indicates system type
             QG2:coordinates="geodetic_2"
       float QY(i);
             QY:coordinates="cylindrical_1"
       
       :cartesian_1="X=x_dist, Y=y_dist, Z=z_dist" // X,Y,Z are keywords
       :geodetic_1="latitude=lat1, longitude=lon1, altitude=pressure"
       :geodetic_2="latitude=lat1, longitude=lon1, altitude=height"
       :cylindrical_1="rho=rho1, theta=theta1, z=cylz"
       
       float x_dist(i);
       float y_dist(i);
       float z_dist(i);
       
       float lat1(i);
       float lon1(i);
       float pressure(i);
             pressure:positive="down";
       float height(i);
             height:positive="up";
       float rho1(i);
       float theta(i);
       float cylz(i);

   You can see that I like Gary Granger's idea of defining a
   "coordinate map", then having variables reference it.  I find this
   preferable to possibly having to define the same coordinate mapping
   several times for each variable in a subset of variables. (If all
   the variables use the same coordinates, this is obviously not a
   problem, as one could just make "coordinates" a global attribute.)

   The prefix of the "coordinate map" name tells what the basic
   coordinate system is, and the user-defined suffix distinguishes
   between multiple maps using the same coordinate system.

   The benefit of the keywords in the map definitions is clarity, and it
   also allows for a case where all variables in the file employ less
   than all three dimensions (four if we include time).

   There was some discussion a while back about the choice of which
   coordinate to use being up to the user or application.  Note that
   the choice of using vertical coordinate "pressure" or "height" above
   (for example) is *not* arbitrary - one could be flat-out *wrong*!
   The values in "pressure" are not necessarily coincident with those
   in "height"; ie, they are not interchangable representaions of
   "altitude".  Only the originator of the data knows for sure.

   Now, all this gets a lot more complicated when mapping multiple
   dimensions into multiple coordinates - I need to put some more
   thought into how that would work, but I think it's feasible.


----------------------
Re: Averaged quantities:  *******!!
    ------------------

   - This is a current (urgent!) concern for us here at GFDL
   - I do not much like the idea of a time-axis being 2-D
         (as is needed to use the "wrt" attribute)
   - Because the "contraction" attribute is proposed to refer to
        a dimension of length 1, this will not help in the case
	of a series of time-means (eg, time-series of monthly
	average temperature);
   - I can, with our current local conventions, document a certain 
        class of averages now, without "wrt" (example below)
   - But, I have trouble with the type of average that "extracts" 
        subsets of points along the time-line (example below)

       **NOTE: this is a general problem along *any* axis!
                  (Examples below...)


 First, the easy case (by way of example):
 -----
   ** A time series (3) of January average temperature, where each **
   ** monthly mean is derived from daily means.                    **

 Our local approach uses up to 6 additional quantities to store the
 required information.  The first 3 of these are used in all cases.  
 The last 3 can be used to describe the items comprising the average.

  dimensions:
     time = 3;
     day  = 31;
  
  variables:
     float Tavg(time);
           Tavg:long_name="Average monthly temperature" ;
           Tavg:units="deg_K"
           Tavg:average_info="T1, T2, nitems, time_of_item, \
	                            item_is_avg, dt_item";
     double time(time);
            time:units="days since 1-1-1990";
	    time:calendar="common_year";

     double T1(time);
            T1:long_name="starting time of average";
            T1:units="days since 1-1-1990";
	    T1:calendar="common_year";
     
     double T2(time);
            T2:long_name="ending time of average";
            T2:units="days since 1-1-1990";
	    T2:calendar="common_year";

     long nitems(time);
          nitems:long_name="Number of items in average";
      
     float time_of_item(day,time);
           time_of_item:long_name="time of individual items comprising average"
           time_of_item:units="days since 1-1-1990";

     short item_is_avg(day,time);
           item_is_avg:long_name="flag indicating whether item in average is itself an average";

     double dt_item(day,time);
            dt_item:long_name="length of time over which the items comprising average are representative";
            dt_item:units="days";

   data:
      time = 15.5, 380.5, 745.5 ;
      
      T1 =    0.,  365.,  730. ;
      T2 =   31.,  396.,  761. ;

      nitems = 31, 31, 31;

      time_of_item =   0.5,   1.5,   2.5, ...  30.5,
                     365.5, 366.5, 367.5, ... 395.5,
		     730.5, 731.5, 732.5, ... 760.5 ;

      item_is_avg = 1, 1, 1, ... 1, 
                    1, 1, 1, ... 1, 
		    1, 1, 1, ... 1 ;

      dt_item = 1., 1., 1., ... 1., 
                1., 1., 1., ... 1., 
	        1., 1., 1., ... 1. ;


  This works fine, because each mean is taken over a continuous span of
  time (ie, all of a January). "T1" and "T2" bracket the period.  The
  "time" value is only somewhat arbitrary.  (It seems logical to me
  that it be the midpoint of the averaging period, but I've heard
  others argue for assigning it a time equal to the starting or ending
  time of each period.)  It is flexible enough to handle disparate
  items included in the average.  And, "time" stays 1-D.
  
  * How would you handle this case using "contraction" and "wrt"?


-------
 NOW, the hard case (again, by way of example):
 ---
  **  5-year average of the daily avg Temperature for each **
  **   of January 1,2,3  (ignoring any 2-D location)       **

         (This case was inspired by *your* example in your
	   response to my question about the first situation.)

  Here's what I would do with our local approach:

   dimensions:
     time=3;
     nyr=5;

   variables:
     float Tavg(time);
           Tavg:long_name="Average monthly temperature" ;
           Tavg:units="deg_K"
           Tavg:average_info="T1, T2, nitems, time_of_item, \
	                            item_is_avg, dt_item";
      
     double time(time);
            time:units="days since 1-1-1990";
            time:calendar="common_year";
      
     float T1(time);
            T1:long_name="starting time of average";
            T1:units="days since 1-1-1990";
	    time:calendar="common_year";
      
     float T2(time);
            T2:long_name="ending time of average";
            T2:units="days since 1-1-1990";
	    time:calendar="common_year";
      
     long nitems(time);
          nitems:long_name="Number of items in average";
      
     float time_of_item(nyr,time);
           time_of_item:long_name="time of individual items comprising average"
           time_of_item:units="days since 1-1-1990";

     short item_is_avg(nyr,time);
           item_is_avg:long_name="flag indicating whether item in average is itself and average";

     double dt_item(nyr,time);
            dt_item:long_name="length of time over which the items comprising average are representative";
            dt_item:units="days";


   data:
      time = 0.5, 1.5, 2.5 ;  // where do you place these in time?
                              //  - personal choice?  case-driven?
      
      T1 =    0.,    1.,    2. ;   // How do you express the time interval
      T2 = 1826., 1827., 1828. ;   //  over which the average was taken?
 / or ??:
|    T1 =    0.,    1.,    2. ;
|    T2 =    1.,    2.,    3. ;
 \
      nitems = 5,5,5;
      time_of_item = 0.5, 365.5, 730.5, 1095.5, 1460.5, 1825.5,
                     1.5, 366.5, 731.5, 1096.5, 1461.5, 1826.5,
		     2.5, 367.5, 732.5, 1097.5, 1462.5, 1827.5 ;

      item_is_avg = 1, 1, 1, 1, 1, 
                    1, 1, 1, 1, 1, 
		    1, 1, 1, 1, 1 ;

      dt_item = 1., 1., 1., 1., 1., 
                1., 1., 1., 1., 1., 
	        1., 1., 1., 1., 1. ;


  Admittedly, this is not a very robust solution. But the principal
  problems are in specifying the "time" coordinate to assign to each
  point, and how to specify the boundaries of the period over which the
  average was taken.  And these are only problems because we've
  picked items out of a continuous stream and processed only them.

  The decomposition of the time axis into 2 dimensions using the "wrt"
  approach seems to solve the latter problem to a large extent (at the
  expense of added complexity (IMHO:-) and the necessity of dealing
  with a 2-D time axis).  The starting and ending "times" of the
  average are (effectively) "1990" and "1994".  But we still have the
  problem of what "time" coordinate value to assign to the data.  Your
  example indicated that you thought it should be the starting
  end-point of the averaging period.

  **
  ** What we lack is a way to express the fact the we have "extracted"
  **  certain points out of a continuum and averaged only those points.
  **  ie, the average was not truly *along* the "time" axis!
  **
  ** Again, there are 2 problems associated with this type of average: 
  **   1. *where* to "locate" the data along the contracted axis;
  **   2. how to document the span of coordinate values over which
  **         the average was taken (since part of the total span
  **         isn't actually used in the calculation)
  **

  There probably is no "right" answer to those questions, but perhaps
  we could clarify things simply by adding a descriptive attribute to
  the "time" coordinate variable.  If the time span represented in the
  average were continuous (as in the earlier example), something like
  the following would suffice:

     time:domain_type="continuous"; \\ any "span" of times includes
				    \\ all points in between

  This says that the points along the "time" axis describe a continuum,
  so that specifying a T1 and T2 for an average implicitly includes
  all the "times" in-between.

  In a case (like the "hard" example above) where data is "extracted"
  in some periodic fashion out of the continuous time-line and
  "contracted", we could have:

     time:domain_type="regular";

  Here, the "times" are disjoint, but regular.  We would use the
  variables "time_of_item", "item_is_avg", and "dt_item" to record the
  location along the time axis of the extracted/contracted data, the
  delta-t for which the average is considered representative, and
  whether or not the data values comprising the average were,
  themselves, averages (eg, taking an average of the daily average
  temperature for January 1, over the years 1990-1994).  Now, we can
  specify "T1(1)=0., T2=1826." and know that while these are the
  endpoints of the time period for which the January 1 average was
  constructed, the average does *not* include all data between
  0.<time<1826.  We can look at "time_of_item", "item_is_avg", and
  "dt_item" to get the necessary characteristics of the quantity.

  It might be possible to simplify this a bit if we specify:

     time:domain_type="regular: <start>, <interval>, <width>";

  and the average can be characterized by specifying a starting
  coordinate for the first point, an interval between points, and the
  width of the "time" window at each time.

  If the "extracted" data were not regular, or did not have the same
  delta-t at each sample window, we could use:

     time:domain_type="irregular";

I still need to think about the most concise way to merge all this and
rationalize it with your "bounds", "subcell" and "contraction"
concepts.  I would be very interested in hearing your thoughts.

---

BTW (though I'm sure this is obvious to you also), this problem is not
necessarily limited to the time axis. Some examples (simple and
contrived) of averages of "extracted" data:

  a)  mean zonal wind within the two principal storm tracks
       (longitude=160-240E and 280-350E)

  b) mean combined cloudiness for the two layers 200-400mb and
       700-850mb


Cheers-
John