Re: Seeking .nc advice for sequence data

To: netcdf-java@xxxxxxxxxxxxxxxx
Subject: Re: Seeking .nc advice for sequence data
From: "Bob Simons" <Bob.Simons@xxxxxxxx>
Date: Thu, 22 Dec 2005 09:22:57 -0800



John Caron wrote:

Hi Bob:
We've been working on these kinds of problems, and thinking abouthow/whether to use sequences or not. So im just going to work throughyour use case below:
Bob Simons wrote:
I am trying to figure out how to store lots of sequence-like data in.nc files for efficient access via OPeNDAP. In particular, I am tryingto determine if actual OPeNDAP Sequences (Structures with an unlimiteddimension in the .nc file) is not appropriate for our purposes.
Yes, I could store the data in a file on the computer where theprogram needing access is running, and not have to access it viaOPeNDAP, so that network transmission time would be minimized. Butthis project is partly an experiment in dealing with remotely accesseddata. So I am trying to design a solution where the data is accessedfrom another computer via OPeNDAP.
Here's an example. Let's say I want to store all NDBC buoy data in a.nc file. There are over 100 buoys. For each buoy, there are readingsfor some time period (e.g., just 1989, or from 1990 to the present).The readings are an hour apart. Several variables (e.g., WindSpeed andWindDirection) are measured at each time point. Since we work withreal-time data, I plan to update this file frequently (every day, butideally every hour).
How large do you expect the file to get (total number of readings ) ?

reading == record == one structure in the sequence.


Approximately:

There are 400 buoys * ~8 years of data * 8760 hours/year = ~28,000,000records.

Given the great variation to time ranges for each buoy, I will probablyarrange it as 400 sequences (one per buoy), with an average of 8 * 8760= 70,080 records per sequence.


Each record has 1 double and 15 floats; hence
  68 bytes/record * 28,000,000 records = ~1.9GB

The problem is, I need to have *quick* access via OPeNDAP:
* Across all buoys at a specific time point, e.g., What is the windspeed at all buoys at 2004-12-14T09:00Z?* Or, for all time points available, what is the wind speed, forexample, at a specific buoy?
Do you need other queries, like "find all readings with wind speed > 30mph" ??


In general, no.

The most common variant of the first request above is: restrict therequest to all buoys in a rectangular geographic region, .... But Ican separately manage and subset the geographic locations of the buoys,if needed.

Regarding the first requirement, from what I understand, if I usesequences, there is no way to get the data for a given time pointwithout reading either the whole file up to that time point, orwithout reading a whole variable. Either of which would seem to taketoo long if I want the values for 100 buoys (given that I am usingOPeNDAP to connect to a remote computer and want the response quicklyfor my CoastWatch Browser program, which graphs the data for on-lineusers who want a quick response).
im not sure how your specific dods server works, but theres a goodchange that the server has to read the entire file to answer your query.We need to find that out, if we are going to figure out how to scale this.

I definitely want to avoid having the server read the entire file foreach query. That's why I ask (below) about avoiding structures and justusing lots of variables.

in the opendap world, you can in fact put a CE (constraint expression)on the sequence, eg your first query would be something like "time =2004-12-14T09:00Z", and your second "buoy=2309" (ill have to check theexacct syntax). Now we dont yet properly support that in the nj22library, but I think it may not be that hard to do. The hard problem isprobably on the server, if it has to read the entire file to answer it.

I'm not sure why you say it has to read the entire file. If I set up thefile in certain ways, can't that be avoided?

Since the time range of available data for each buoy varies greatly,it seems grossly wasteful of space to have a common Time dimension forall buoys. Doing so would probably force me over the 2GB file size,which is generally trouble. So I am thinking about either:* A time dimension for each buoy (e.g., time14978 for buoy 14978) anda several variables which use that dimension to store the data forthat buoy (e.g., windSpeed14978, windDirection14978, etc.). Thissetup would be replicated for each buoy.

I am leaning toward this. It is easy to understand. It doesn't use anyspecial features of .nc or OPeNDAP, so should work will differentservers and clients.

* Or, a Group for each buoy, again with a time dimension and severalvariables in each group to store the data for each buoy. (If this isa new .nc feature, does OPeNDAP deal with this yet?)* Or, an ArrayObject.1D of variables, each element of which is anArrayObject.1D of the variables for a given buoy. (I'm not sure ifthis can be done.)* Or, an ArrayObject.2D of variables, with buoys as one dimension andthe various variables (e.g., WindSpeed, WindDirection) on the otherdimension. (I'm not sure if this can be done.)
Our current thinking on how to write netcdf files for "observation data"is written up at:
http://www.unidata.ucar.edu/software/netcdf-java/formats/UnidataObsConvention.html
in particular, appending records using backwards-linked lists seems likea good solution, and its what we are currently doing with the realtimemetar data on motherlode.

I really don't like linked lists. They force each query to go throughthe all the rows (and read all of the data for each row).

I think separate variables are the way to go. They can be madeexpandable in the way that Java's ArrayList is expandable: have abacking array, and keep track of how many elements are currently in use(size). If you need more capacity, make a new larger array and copy thevalues to it. Then you can get random access to any value in anyvariable. Further, if you need to do constraints, you only need to readthe constraint variables, and even then you can minimize the reads. Forexample, if I sort an buoy's records by time and have a query like "time>= t1 && time <= t2 && windSpeed > 30", I don't even have to read thewindSpeed variable until I find a record in the correct time range. AndI never have to read the other variables until I know the constraintexpression is satisfied.

In fact, part of my raising this question was to try to figure outwhy/when structures/sequences are a good approach. I feel like I'mmissing something. It looks like they are implemented as linked lists.If so, they don't look like a good data structure to me, because theyforce all file accesses to go through the whole file. They are efficientwhen appending data (which you do infrequently and when you care lessabout speed), but inefficient for searches (both sequential searches ofone or a few variables, or random access to any datum) (which you dofrequently and when you really care about speed). And there are otherdata structures (in the style of Java's ArrayList) which are efficientfor writing (random access or appending) and reading (sequential orrandom access). Comments?

I plan to solve the updating problem by leaving rows of missing valuesat the end of the data for each active buoy. As new data comes in, Iwill replace the missing values with actual data. Then, I only have torewrite the file (to add more rows of missing values) once in a while,not every time.
the above approach, if it works for you, probably obviates this.
Which approach sounds best? Is there another approach? Do you haveany advice?
Are sequences the wrong way to go? Of course, that could change ifone could efficiently access specific ranges from variables in aSequence/Structure. But it my understanding that that is notcurrently possible.
The DAP 2 spec currently does not allow this. But the whole point ofsequences is to allow you to subset using a query (i think its called aselection), and then only return the data needed, so you dont need indexsubsetting.


But if the server has to go through the whole file, it will never be fast.

Although I gave this specific example, we store a lot of sequence-likedata where I work. Whatever .nc file structure is appropriate for thebuoys will likely be appropriate for much of this other data. So Iwant to get it right.
Right now, Id say that it depends on what server you are using.Sequences are elegant, but they are a different animal from indexedaccess that is the bread and butter of netcdf files.
The critical things to answer first:
 1. How many records will you serve? What about in the future?


Approximately:
400 buoys * ~8 years of data * 8760 hours/year = ~28,000,000 records
Each record has 1 double and 15 floats; hence
  68 bytes/record * 28,000,000 records = ~1.9GB

More than half of the buoys are active so it will grow by about:
 300 buoys * 8760 hours/year = 2,628,000 records / year
 (almost 200MB/year)

Given that it is close to 2GB, I may separate it into a file forinactive buoys and a file for active buoys.

 2. What queries do you need to support?
 3. What response time is acceptable ?

I would like 1 second search time on the server + whatever the networktransmission time is for OPeNDAP to send me the results. Note that theresults are often/usually < 100 KB of data. I am willing to do a lot toget that response time, e.g., store the buoy locations and time rangesin memory. Buoy readings are every hour, but with gaps. So perhaps Iwould constrain the times for each buoy's reading to be regularly spaced(e.g., missing data would appear as rows of missing values in the file),so that I can very quickly calculate the relevant row(s) of data basedon time constraints.

 4. What clients do you want to support? Just your own, or more general?

I guess I only care about my client. But it seems like if I do thisright, it will be useful to any client that works with a given server.For me, OPeNDAP is here now and available for no effort on my part. So,any OPeNDAP client can use the .nc file. Presumable, other servers (LAS,THREDDS) could use the file, too, in the future.

 5. What server do you want to use? Does it matter?

It doesn't matter to me, except for ease of use. So I'm stronglyinclined to use one of the OPeNDAP servers which is already administeredhere (by someone else). I'll make the file. They'll serve it.



Sincerely,

Bob Simons
Satellite Data Product Manager
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
(831)658-3205
bob.simons@xxxxxxxx
<>< <>< <>< <>< <>< <>< <>< <>< <><

Follow-Ups:
- Re: Seeking .nc advice for sequence data
  - From: John Caron

References:
- Seeking .nc advice for sequence data
  - From: Bob Simons
- Re: Seeking .nc advice for sequence data
  - From: John Caron

2005 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdf-java archives: