« Anti-Patterns, Refac... | Main | Indexed data access... »

Do blogs about Coordinates need to be monotonous?

04 May 2011

Last year we realized that a lot of the GRIB model output from NCEP forecast models were not instantaneous time values, but represented values over a time interval, e.g. averages or accumulations. The CDM library essentially reads collections of GRIB records and makes them available as if they are CF-compliant netCDF datasets. It was embarrassing to have the basic coordinate system wrong for the past 7 years for about 25% of NCEP GRIB variables, but I guess "caveat emptor" or maybe "you get what you pay for" (since all our software is free) . Probably most people just want to look at pretty pictures in the IDV or something, so hopefully no science was done with this faulty metadata.

At first, we saw time intervals that looked like this:

(0.000000,0.000000)  (0.000000,1.000000)  (0.000000,2.000000)  (0.000000,3.000000)  (3.000000,4.000000)  (3.000000,5.000000)  (3.000000,6.000000)  (6.000000,7.000000)  (6.000000,8.000000)  (6.000000,9.000000)  (9.000000,10.000000)  (9.000000,11.000000)  (9.000000,12.000000)  (12.000000,13.000000)  (12.000000,14.000000)  (12.000000,15.000000)  (15.000000,16.000000)  (15.000000,17.000000)  (15.000000,18.000000)

where each of the 2 values represent a single time coordinate, defining the beginning and ending time in hours from the run time. Note that the ending times are unique, and correspond to the "forecast time". So at first it seemed that these time interval coordinates are just a variation of the instantaneous time coordinates, which in this case would consist of just the ending "forecast time" values. This example comes from the RUC2 model, variable Water_equivalent_of_accumulated_snow_depth.

However, we then noticed that there were other variables with more complicated interval coordinates , e.g. the Convective_precipitation from the same model:

(0.000000,0.000000)  (0.000000,1.000000)  (0.000000,2.000000)  (1.000000,2.000000)  (0.000000,3.000000)  (2.000000,3.000000)  (0.000000,4.000000)  (3.000000,4.000000)  (0.000000,5.000000)  (3.000000,5.000000)  (4.000000,5.000000)  (0.000000,6.000000)  (3.000000,6.000000)  (5.000000,6.000000)  (0.000000,7.000000)  (6.000000,7.000000)  (0.000000,8.000000)  (6.000000,8.000000)  (7.000000,8.000000)  (0.000000,9.000000)  (6.000000,9.000000)  (8.000000,9.000000)  (0.000000,10.000000)  (9.000000,10.000000)  (0.000000,11.000000)  (9.000000,11.000000)  (10.000000,11.000000)  (0.000000,12.000000)  (9.000000,12.000000)  (11.000000,12.000000)  (0.000000,13.000000)  (12.000000,13.000000)  (0.000000,14.000000)  (12.000000,14.000000)  (13.000000,14.000000)  (0.000000,15.000000)  (12.000000,15.000000)  (14.000000,15.000000)  (0.000000,16.000000)  (15.000000,16.000000)  (0.000000,17.000000)  (15.000000,17.000000)  (16.000000,17.000000)  (0.000000,18.000000)  (15.000000,18.000000)  (17.000000,18.000000)

At first we tried to divide these into multiple variables with fixed length interval times (1-hour accumulations, 2-hour accumulations etc). But if you look closely, they don't really divide into such neat categories. The NCEP group gave a good explanation as to why they output this mix of intervals, and the goal of the CDM library is to expose all the data and let the user decide how to use it. So we decided to give up trying to be clever and just combine all intervals into a single time coordinate variable.

I know, I know, you are worried that this violates one of the requirements for coordinate variables, defined in the CF spec as "a one-dimensional variable with the same name as its dimension ... with values that are ordered monotonically." Here we don't have monotonically ordered values. But, um, why do we need monotonic coordinate variables?

The first part of the answer is that we want coordinate values to be unique, so that we can find the index that corresponds to it. The second part is that if we think of a variable as a sampled function, then we want to be able to interpolate between the samples. If you apply both requirements to a one dimensional variable, you get the requirement that the values of the coordinate variables must be strictly monotonic increasing or decreasing (i.e. c1 > c2 > c3 ... or c1 < c2 < c3 ...).

More formally, if you think of a coordinate variable as a function from index space to coordinate space, in order to make that function invertible (so that one can unambiguously map from a coordinate to an index) you need the values to be monotonic. Here are two functions, one is monotonic, the other is not, and you can see that only with the first one can you assign a unique x (index) to each y=f(x) (coordinate value).

Ok, but does monotonicity apply to interval coordinates? No, because interpolation doesn't apply, at least not in any simple way, because there are two values to each coordinate, so there is no natural ordering defined on intervals. However, uniqueness can still be used to find the index that corresponds to a given time interval. Unique but non-monotonic values have been previously used to describe nominal coordinates, e.g. ones with names instead of values, used for classification like "forest", "savanna", "tundra", etc. It doesn't make sense to interpolate between nominal coordinates, nor interval coordinates.

Rather than saying that coordinate variables must be monotonic (strictly speaking, that's "strictly monotonic"), one could say that coordinate variables considered as a function from index space to coordinate space should be invertible. For one dimensional, single valued coordinate variables, monotonicity is necessary and sufficient to be invertible. For nominal and interval coordinates, only uniqueness is possible. What constraint is needed for 2 dimensional coordinates to be invertible, e.g. curvilinear coordinate systems like lat(x,y), lon(x,y)?

The intuitive answer is that if you connect your lat(x,y), lon(x,y) points in a mesh and plot that, no matter how curvy the lines get, as long as they don't cross, then you have an invertible coordinate system. That's the analogue of monotonicity in 2 dimensions:

The moral of the story is that lots of datasets have interval coordinates, especially for time and vertical dimensions. Monotonic values should be used when possible, but not otherwise. Technically, one can name the coordinate something else other than the dimension (in CF parlance, this is an auxiliary coordinate), in case you want to keep your datasets on a strict CF kosher diet. The CDM does a few special things with coordinate variables, but mostly it doesn't care whether a coordinate is an auxiliary coordinate variable or a real coordinate variable. So this is a good solution for the CF-compliant enthusiast.

Since you can't always assume monotonicity, applications will have to continue to get more sophisticated in handling data. And remember, if you are going to Use Data to Make Important Conclusions, you should understand the data from the source and check that your software gives results consistent with the source. If you see any problems, report them to the community and the software developers So Things Get Better Eventually. May God bless our buggy little corner of the universe.

Posted by $entry.creator.screenName