Re: Seeking .nc advice for sequence data

Hi Bob:

We've been working on these kinds of problems, and thinking about how/whether 
to use sequences or not. So im just going to work through your use case below:

Bob Simons wrote:
I am trying to figure out how to store lots of sequence-like data in .nc files for efficient access via OPeNDAP. In particular, I am trying to determine if actual OPeNDAP Sequences (Structures with an unlimited dimension in the .nc file) is not appropriate for our purposes.

Yes, I could store the data in a file on the computer where the program needing access is running, and not have to access it via OPeNDAP, so that network transmission time would be minimized. But this project is partly an experiment in dealing with remotely accessed data. So I am trying to design a solution where the data is accessed from another computer via OPeNDAP.

Here's an example. Let's say I want to store all NDBC buoy data in a .nc file. There are over 100 buoys. For each buoy, there are readings for some time period (e.g., just 1989, or from 1990 to the present). The readings are an hour apart. Several variables (e.g., WindSpeed and WindDirection) are measured at each time point. Since we work with real-time data, I plan to update this file frequently (every day, but ideally every hour).

How large do you expect the file to get (total number of readings ) ?

reading == record == one structure in the sequence.


The problem is, I need to have *quick* access via OPeNDAP:
* Across all buoys at a specific time point, e.g., What is the wind speed at all buoys at 2004-12-14T09:00Z? * Or, for all time points available, what is the wind speed, for example, at a specific buoy?

Do you need other queries, like "find all readings with wind speed > 30 mph" ??


Regarding the first requirement, from what I understand, if I use sequences, there is no way to get the data for a given time point without reading either the whole file up to that time point, or without reading a whole variable. Either of which would seem to take too long if I want the values for 100 buoys (given that I am using OPeNDAP to connect to a remote computer and want the response quickly for my CoastWatch Browser program, which graphs the data for on-line users who want a quick response).

im not sure how your specific dods server works, but theres a good change that 
the server has to read the entire file to answer your query. We need to find 
that out, if we are going to figure out how to scale this.

in the opendap world, you can in fact put a CE (constraint expression) on the sequence, eg your 
first query would be something like "time = 2004-12-14T09:00Z", and your second 
"buoy=2309" (ill have to check the exacct syntax). Now we dont yet properly support that 
in the nj22 library, but I think it may not be that hard to do. The hard problem is probably on the 
server, if it has to read the entire file to answer it.



Since the time range of available data for each buoy varies greatly, it seems grossly wasteful of space to have a common Time dimension for all buoys. Doing so would probably force me over the 2GB file size, which is generally trouble. So I am thinking about either: * A time dimension for each buoy (e.g., time14978 for buoy 14978) and a several variables which use that dimension to store the data for that buoy (e.g., windSpeed14978, windDirection14978, etc.). This setup would be replicated for each buoy. * Or, a Group for each buoy, again with a time dimension and several variables in each group to store the data for each buoy. (If this is a new .nc feature, does OPeNDAP deal with this yet?) * Or, an ArrayObject.1D of variables, each element of which is an ArrayObject.1D of the variables for a given buoy. (I'm not sure if this can be done.) * Or, an ArrayObject.2D of variables, with buoys as one dimension and the various variables (e.g., WindSpeed, WindDirection) on the other dimension. (I'm not sure if this can be done.)

Our current thinking on how to write netcdf files for "observation data" is 
written up at:

 
http://www.unidata.ucar.edu/software/netcdf-java/formats/UnidataObsConvention.html

in particular, appending records using backwards-linked lists seems like a good 
solution, and its what we are currently doing with the realtime metar data on 
motherlode.



I plan to solve the updating problem by leaving rows of missing values at the end of the data for each active buoy. As new data comes in, I will replace the missing values with actual data. Then, I only have to rewrite the file (to add more rows of missing values) once in a while, not every time.

the above approach, if it works for you, probably obviates this.


Which approach sounds best? Is there another approach? Do you have any advice?

Are sequences the wrong way to go? Of course, that could change if one could efficiently access specific ranges from variables in a Sequence/Structure. But it my understanding that that is not currently possible.

The DAP 2 spec currently does not allow this. But the whole point of sequences 
is to allow you to subset using a query (i think its called a selection), and 
then only return the data needed, so you dont need index subsetting.


Although I gave this specific example, we store a lot of sequence-like data where I work. Whatever .nc file structure is appropriate for the buoys will likely be appropriate for much of this other data. So I want to get it right.

Right now, Id say that it depends on what server you are using. Sequences are 
elegant, but they are a different animal from indexed access that is the bread 
and butter of netcdf files.

The critical things to answer first:
 1. How many records will you serve? What about in the future?
 2. What queries do you need to support?
 3. What response time is acceptable ?
 4. What clients do you want to support? Just your own, or more general?
 5. What server do you want to use? Does it matter?