[cf-pointobsconvention] Draft 2
John Caron
caron at unidata.ucar.edu
Fri Sep 21 17:12:49 MDT 2007
Hi Jonathan:
comments inline
Jonathan Gregory wrote:
> Dear John
>
> (1) Thank you for your careful analysis and examples of the needs for this
> kind of data. I've been wondering how one can characterise it in
> general. Although most cases are "point" data of some kind, perhaps it would
> be right to describe all these kinds of data as "ungridded". That's a word
> which is used in various ways but it seems most apt to me for this
> situation. In terms of structure, what you are describing are kinds of data
> where the size of one dimension may vary as a function of index along another
> dimension. That is "ungridded" in a deeper sense than data which is not evenly
> arranged in x and y but is still contained within a rectangular array, for
> instance.
Yes, this is not the same as "irregularly spaced grids". I was trying to
emphasize the kinds of data i thought this would cover, including "station"
"trajectory" "profile" (etc). Perhaps "ungridded" is better than "popint"
(others have an opinion?).
I've been calling the case of "size of one dimension may vary as a function
of index along another dimension" as "ragged arrays" (versus the usual
"rectangular arrays"). That is the difficult case to solve, but this spec would
also cover cases where a rectangular array is useful, eg if all profiles had the
same number of vertical coords. I should make that clearer.
>
> (2) While the analogy to tables and SQL is interesting, personally I find the
> CDL expression most obvious. Moreover, it would be a fairly small extension to
> CF to include this kind of indirection. It is rather like the method described
> for compression by gathering in CF 8.2:
>
> dimensions:
> lat=73;
> lon=96;
> landpoint=2381;
> depth=4;
> variables:
> int landpoint(landpoint);
> landpoint:compress="lat lon";
> float landsoilt(depth,landpoint);
> landsoilt:long_name="soil temperature";
> landsoilt:units="K";
> float depth(depth);
> float lat(lat);
> float lon(lon);
>
> Here the coordinate variable of the "gather" dimension (landpoint) is an index
> into the two dimensions which were jointly compressed by the gathering. As you
> mention, we could represent the ungridded case in a very inefficient way by
> constructing a coordinate variable which contains all possible values of the
> variable-size dimension, for instance:
>
> dimensions:
> station=10;
> pressure=11;
> allpossibletimes=6289; // for instance
> variables:
> double allpossibletimes(allpossibletimes);
> float pressure(pressure);
> float latitude(station);
> float humidity(pressure,allpossibletimes,station);
>
> and then compress it to eliminate the (time,station) combinations which don't
> occur:
>
> dimensions:
> station=10;
> pressure=11;
> allpossibletimes=6289;
> record=7478; // for instance
> variables:
> double allpossibletimes(allpossibletimes);
> float pressure(pressure);
> float latitude(station);
> int record(record);
> record:compress="allpossibletimes station";
> float humidity(pressure,record);
>
> That would be workable for the ungridded case. It can even be more efficient
> than the schemes you describe, as it allows reuse of times that are common to
> more than one station, but it doesn't seem natural, as you don't really regard
> ungridded data as a compression of a huge sparse array. Instead of combining
> indices to station and time, you prefer to keep them separate:
>
> dimensions:
> station=10;
> pressure=11;
> record=7478;
> variables:
> float latitude(station);
> int station_index(record);
> station_index:compress="station";
> double times(record);
> float pressure(pressure);
> float humidity(pressure,record);
> humidity:coordinates="station_index";
>
> This is not the purpose for which the compress attribute was defined, but what
> we need here is similar. The compress attribute indicates that the value of
> its variable is an index into the dimensions listed, and if only one dimension
> is listed, it must be a 1D index. In this application the index will have many
> repeated values, because it's doing few->many by duplication rather than
> many->few by eliminating unused entries as it does when gathering. We could
> give the attribute a different name, since it's being used for a different
> purpose, and because it's being attached to an auxiliary coordinate variable
> rather than a coordinate variable.
I actually had mentioned section 8.2 in draft 1, but took it out because i didnt
understand it very well, so thanks for explaining that. I hadnt thought of it as
a "sparse array", thats an interesting POV.
Im actually proposing a new "table" data type with "index joins" as a way to
think about these types of data. This does look like your example above
(though you need to keep the record dimension as the outer dimension):
float latitude(station);
double times(record);
int station_index(record);
station_index:compress="station";
float humidity(record,pressure);
humidity:coordinates="station_index";
float pressure(pressure);
where ive grouped variables by outer dimension. insterad of thinking of it as a
compression, though, you just think of it as connecting "tables" together.
>
> (3) In your examples you have auxiliary coordinate variables such as
> z(sample,z). In CF we recommend against giving an aux coord var the same name
> as a dimension, because this could confuse any software that was looking for
> (Unidata) coord vars but didn't check how many dimensions they had.
thanks, i will change the names to avoid confusion.
>
> (4) Much of the subsequent discussion has been about your proposed dataset
> classification. I think that the quantity of discussion indicates that the
> distinction is hard to draw, because it's one of interpretation and purpose
> rather than structure. I believe you intend this attribute as discovery
> metadata, don't you.
It is a discovery attribute and also used by clients to know how to interpret
the connectivity of the data. Without it, one could not (for example)
distinguish a collection of earthquake data from a trajectory. Both look like:
variables;
float lon(obs);
float lat(obs);
float z(obs);
double time(obs);
float dataVar(obs);
dataVar:coordinates = “lon lat z time”;
Is it possible you could store such a description in one
> of the existing global attributes whose contents aren't standardised by CF?
>
> Best wishes
>
> Jonathan
Thanks for your feedback.
More information about the cf-pointobsconvention
mailing list