« Do blogs about Coord... | Main | netcdf-4 DLLs »

Indexed data access and coordinate contract violations

10 May 2011

A netCDF variable is a multidimensional data array with attributes. The shape of the array is specified by dimensions that are shared among the variables in the file, e.g.:

   float data(time=40654, lat=360, lon=720);

The standard netCDF API allows the user to read a rectangular, strided subset of the data. With the netcdf-3 file format, a knowledgable user can easily predict the cost of data access, and the library can efficiently read the data from disk. The netcdf-4 file format is more complicated, but with more effort the user can again predict the relative costs of different kinds of data access requests.

The addition of coordinate information adds a lot of functionality, because users are almost never interested in the value of the (i,j,k) element per-se, but really want to get the value at a specific coordinate, in this example, a specific time and location on the earth:

float data(time=40654, lat=360, lon=720);
   float time(time);
   float lat(lat);
   float lon(lon);

Coordinate values create a kind of implicit contract, in a subtle way that I will explain. A typical use case might be that a visualization application reads and displays the list of time coordinates, the user selects one, the application determines which index the coordinate corresponds to, and then makes a data request on the data variable at that index. This is called index-based data access. The contract is that the data that comes back corresponds to the selected time coordinate, that is, the mapping between coordinates and indices does not change. As long as the NetCDF file doesn't change, all of this is simple and the contract trivially holds. (If you are someone who simultaneously reads and writes to a netCDF file, you may stop reading now, this article is not for you.)

There are two new features of netCDF that might complicate this simple contract. The first is remote data access, for example using the OPeNDAP protocol. Remote access is now built into the standard netCDF C library (its been in the netCDF Java library for a long time), and you can "open" a remote dataset by giving an OPeNDAP URL to the open call. Then you can read the data almost completely oblivious as to whether you are reading a local or remote file. I say almost because the performance is different - each call is a separate "round trip" to the server, which costs a certain amount of time known as latency. So you don't want to make a zillion little data requests, you want to make large enough requests so that the latency is a minor part of the data access cost. Local file access will be much more forgiving of a zilliion little data requests as long as they are close together in the underlying file storage.

The coordinate contract means that when you open the remote dataset, the dataset must not change in a way that changes the relationship of coordinates to indices. But because the data is remote, you don't know for sure it hasn't changed. I know of more than one data provider who does things like rewrites a file in place, or replaces it without warning. I'm not sure if that practice might cause a coordinate contract violation to occur, and I'm not sure if the data providers know either. The right thing for the server to do is either prevent a dataset from changing while a remote user has it open, or to detect such changes and force the remote user to re-open the dataset. The TDS server has experimented with this capability, but its not completely implemented and operational.

Note that both strategies violate the stateless nature of HTTP, where each call to the server is supposed to be completely independent of any other call. OPeNDAP is also designed as a stateless protocol on top of HTTP. In reality, many important services that use HTTP are not stateless, for example single sign on and web shopping carts. These keep state on the server, typically by getting the client to return a session cookie to track which calls belong to which session. The TDS does a similar thing, including for OPeNDAP data requests.

Let us suppose that the server has read-only files and these are never rewritten. Then are we safe from coordinate contract violations? This brings us to the second feature that causes complications, namely file aggregation. This is currently available in the netCDF-Java libary, and not the netCDF C library, but other servers like GDS and Hyrax are adding similar features. Aggregation allows collections of files to be seen as a single dataset. Its a very important capability because it reduces the complexity that the user sees, as well as hiding the specific way that the data provider has partitioned her data into files.

>There are 2 sources of problems with aggregations. The first are homogeneity constraints on the set of files. Aggregation software makes certain assumptions about the ways that all of the files in the aggregation are the same. If one of the files violates the assumption, then the server can return incorrect data for the aggregated dataset, which often looks like a coordinate contract violation that requires an expert user to detect. This however is really a software defect, and in time, aggregation software will improve to detect violations of homogeneity constraints by carefully processing the files before putting them online.

A more difficult situation comes from rolling archives of data, such as on Unidata's motherlode server. Currently we keep the most recent 45 days of data from the IDD. Every 15 minutes, any new data that has arrived is appended to the aggregated dataset. Once a day, the oldest day is deleted. Now consider a remote user who opens a dataset and retrieves the coordinate information, and displays it to the user. Often, there is new data that the user doesn't know about, unless she rereads the coordinates, essentially re-opening the dataset. This is not a violation, since for all the data that she knows about, the coordinates and indices correctly match. But once a day, the oldest day is discarded, and essentially the data indices shift down one day's worth of coordinate indices. So if we have hourly data, a time coordinate T, which corresponded to index i, now corresponds to i-24. There's nothing in the netCDF index-based API that would notify the user that this has happened.

(OPeNDAP actually has a reasonable solution to this problem - if the data is represented as a Grid. One still makes the request in index space, but you get both the data and the corresponding coordinates (map vectors in the OPeNDAP Grid object) back. So an application can at least check that what they get back corresponds to the coordinates that they were expecting. There are some difficulties mapping the OPeNDAP model in a general way to the netCDF model that makes this not a completely satisfactory solution).

Essentially we are back to the scenario of a file getting modified as its being read. Only instead of blaming some data provider who doesn't understand why they shouldn't do that, we have an unavoidable reason why it happens. And it happens constantly, not occasionally as the result of some rare combination of events.

The same solutions are possible - 1) prevent the dataset from changing as long as there is someone actively using it, or 2) detect when it happens and return an error which forces the user to re-open the dataset. Both require tracking user state (which user has what dataset open?), but I don't have any philosophical problem with that, other than adding complexity to the server.

Another possible solution is to request the data using coordinate values, rather than indices. This is how the OGC protocols work, as well as some of our own experimental protocols. If the user asks for a time slice at T, and if that data for T has been deleted, then you just return some version of the 404 response. If the user asks for all the data between T1 and T2, the server returns whatever data it has that intersects that range. The application still discovers what times are available, and allows the user to choose from that list of coordinates, but there's no contract that the data is really there. With indexed data access, the contract has to exist because there's no other way to ask for the data. In database terminology, one has exposed the physical schema to the user, and so reorganizing the data on the server becomes very difficult.

The down side to a non-index approach is that one needs a different data access API, i.e. not the current indexed-based netCDF API. We will keep indexed data access for legacy client applications, and it will be used at the lowest level of a server, but we need to use a new paradigm, using coordinate based data access, for new applications. We can't forever extend applications by adding new functionality like reading remote, aggregated datasets, and have everything work perfectly.

If we are going to break with the ancient ways, handed down from our distant Fortran ancestors, we should think hard about what to replace it with. I'll try to write up my thoughts soon - I would welcome hearing yours.

Posted by $entry.creator.screenName