Re: Seeking .nc advice for sequence data

To: netcdf-java@xxxxxxxxxxxxxxxx
Subject: Re: Seeking .nc advice for sequence data
From: "Bob Simons" <Bob.Simons@xxxxxxxx>
Date: Wed, 28 Dec 2005 12:49:55 -0800



Gerry Creager N5JXS wrote:

OK, so I'm catching up.
Bib, what you describe is very much what I do with a relational database(spatially aware). Admittedly most of the sites we work with are fixedgeographically, but the time series "stuff" is consistent with whatyou're looking to do. In fact, we're now working with a lot of th buoys...
We maintain a station file with geoposition data (spatially indexed,too), and a data table (or 'n' data tables in the normalization we'reattempting now). The two tables are linked by an indexing relation.
I've been able to request (quickly) all site data within a rectangledefined in lat/lon and for a time series, ordered sequentially bydate/time stamp.
Is that what you're trying to achieve? I'm not aware of open sourcetools that will take, say, a PostGIS dataset and turn it into a netCDFfile. I suspect that can be engineered, though.
Am I going down the line you're interested in? If so, I can expand abit. We've been doing this for surface met data for a while.

It sounds like we are doing similar things, but you are querying arelational database and I want to query OPeNDAP (which will get the datafrom a .nc file). I want to use OPeNDAP because I want to access thedata from other computers.

The big question for me has boiled down to: are NetCDF Structures (inthe specific sense of the word) only accessible by readingstructure-by-structure through the file (because they are stored as alinked list), or, can they be accessed individually in an *efficient*way (e.g., give me structure number (a.k.a. record number) 4537 from thefile)? I understand that OPeNDAP may be able to give me structure#4537 either way, but I want to know if OPeNDAP can do it *efficiently*(or if it has to read though record #0 - 4536 to get to #4537).

Stated another way: does using Structures preclude *efficient* indexedaccess to individual Structures in a file?

Roy Mendelssohn pointed me to this clue: The Unidata Observation DatasetConventions document indicates that Structures can be stored as a linkedlist or a contiguous list. But it is unclear if this section of thedocument is relevant (is it just related to dealing with multipleparents?) and in any case, I don't understand how to set up the .nc fileto use one approach or the other. (I understand linked lists; I justdon't understand how you are suggesting that I implement them.) And itis unclear to me if the contiguous list is really storing all of theStructures together on the disk for fast, indexed access to eachindividual Structure from a given parent, or if it is just storing allthe Structures for each parent together (a contiguous list where eachelement points to a linked list).


I'm still confused.

gerry

John Caron wrote:
Bob Simons wrote:
John Caron wrote:
Hi Bob:
We've been working on these kinds of problems, and thinking abouthow/whether to use sequences or not. So im just going to workthrough your use case below:
Bob Simons wrote:
I am trying to figure out how to store lots of sequence-like datain .nc files for efficient access via OPeNDAP. In particular, I amtrying to determine if actual OPeNDAP Sequences (Structures with anunlimited dimension in the .nc file) is not appropriate for ourpurposes.
Yes, I could store the data in a file on the computer where theprogram needing access is running, and not have to access it viaOPeNDAP, so that network transmission time would be minimized. Butthis project is partly an experiment in dealing with remotelyaccessed data. So I am trying to design a solution where the datais accessed from another computer via OPeNDAP.
Here's an example. Let's say I want to store all NDBC buoy data ina .nc file. There are over 100 buoys. For each buoy, there arereadings for some time period (e.g., just 1989, or from 1990 to thepresent). The readings are an hour apart. Several variables (e.g.,WindSpeed and WindDirection) are measured at each time point. Sincewe work with real-time data, I plan to update this file frequently(every day, but ideally every hour).
How large do you expect the file to get (total number of readings ) ?

reading == record == one structure in the sequence.
Approximately:
There are 400 buoys * ~8 years of data * 8760 hours/year =~28,000,000 records.
Given the great variation to time ranges for each buoy, I willprobably arrange it as 400 sequences (one per buoy), with an averageof 8 * 8760 = 70,080 records per sequence.
Each record has 1 double and 15 floats; hence
  68 bytes/record * 28,000,000 records = ~1.9GB
The problem is, I need to have *quick* access via OPeNDAP:
* Across all buoys at a specific time point, e.g., What is the windspeed at all buoys at 2004-12-14T09:00Z?* Or, for all time points available, what is the wind speed, forexample, at a specific buoy?
Do you need other queries, like "find all readings with wind speed >30 mph" ??
In general, no.
The most common variant of the first request above is: restrict therequest to all buoys in a rectangular geographic region, .... But Ican separately manage and subset the geographic locations of thebuoys, if needed.
Regarding the first requirement, from what I understand, if I usesequences, there is no way to get the data for a given time pointwithout reading either the whole file up to that time point, orwithout reading a whole variable. Either of which would seem totake too long if I want the values for 100 buoys (given that I amusing OPeNDAP to connect to a remote computer and want the responsequickly for my CoastWatch Browser program, which graphs the datafor on-line users who want a quick response).
im not sure how your specific dods server works, but theres a goodchange that the server has to read the entire file to answer yourquery. We need to find that out, if we are going to figure out howto scale this.
I definitely want to avoid having the server read the entire file foreach query. That's why I ask (below) about avoiding structures andjust using lots of variables.
in the opendap world, you can in fact put a CE (constraintexpression) on the sequence, eg your first query would be somethinglike "time = 2004-12-14T09:00Z", and your second "buoy=2309" (illhave to check the exacct syntax). Now we dont yet properly supportthat in the nj22 library, but I think it may not be that hard to do.The hard problem is probably on the server, if it has to read theentire file to answer it.
I'm not sure why you say it has to read the entire file. If I set upthe file in certain ways, can't that be avoided?
it depends on the server.
Since the time range of available data for each buoy variesgreatly, it seems grossly wasteful of space to have a common Timedimension for all buoys. Doing so would probably force me over the2GB file size, which is generally trouble. So I am thinking abouteither:* A time dimension for each buoy (e.g., time14978 for buoy 14978)and a several variables which use that dimension to store the datafor that buoy (e.g., windSpeed14978, windDirection14978, etc.).This setup would be replicated for each buoy.
I am leaning toward this. It is easy to understand. It doesn't useany special features of .nc or OPeNDAP, so should work will differentservers and clients.
* Or, a Group for each buoy, again with a time dimension andseveral variables in each group to store the data for each buoy.(If this is a new .nc feature, does OPeNDAP deal with this yet?)* Or, an ArrayObject.1D of variables, each element of which is anArrayObject.1D of the variables for a given buoy. (I'm not sure ifthis can be done.)* Or, an ArrayObject.2D of variables, with buoys as one dimensionand the various variables (e.g., WindSpeed, WindDirection) on theother dimension. (I'm not sure if this can be done.)
Our current thinking on how to write netcdf files for "observationdata" is written up at:
http://www.unidata.ucar.edu/software/netcdf-java/formats/UnidataObsConvention.html
in particular, appending records using backwards-linked lists seemslike a good solution, and its what we are currently doing with therealtime metar data on motherlode.
I really don't like linked lists. They force each query to go throughthe all the rows (and read all of the data for each row).
I think separate variables are the way to go. They can be madeexpandable in the way that Java's ArrayList is expandable: have abacking array, and keep track of how many elements are currently inuse (size). If you need more capacity, make a new larger array andcopy the values to it. Then you can get random access to any valuein any variable. Further, if you need to do constraints, you onlyneed to read the constraint variables, and even then you can minimizethe reads. For example, if I sort an buoy's records by time and havea query like "time >= t1 && time <= t2 && windSpeed > 30", I don'teven have to read the windSpeed variable until I find a record in thecorrect time range. And I never have to read the other variablesuntil I know the constraint expression is satisfied.
In fact, part of my raising this question was to try to figure outwhy/when structures/sequences are a good approach. I feel like I'mmissing something. It looks like they are implemented as linkedlists. If so, they don't look like a good data structure to me,because they force all file accesses to go through the whole file.They are efficient when appending data (which you do infrequently andwhen you care less about speed), but inefficient for searches (bothsequential searches of one or a few variables, or random access toany datum) (which you do frequently and when you really care aboutspeed). And there are other data structures (in the style of Java'sArrayList) which are efficient for writing (random access orappending) and reading (sequential or random access). Comments?
Im a bit confused if were talking about the server or the client?
Im assuming that you want to write a netcdf file that a opendap servercan serve? I dont know of any that would automatically servesequences. We are looking at adding that to the THREDDS data server(TDS), but havent yet.Netcdf-3 files can only be expanded along one dimension. Netcdf-4 wonthave that limitation, but they arent ready yet. In theUnidataObsConvention, we write along the record dimension as the datacomes in, there's a seperate linked list for each station. As long asyou keep the files moderately small, this isnt a bad solution formodest datasets. We may add an external indexing capability in the TDSfor large datasets.
To optimize the search, I would probably not use the linked list, butthe "contiguous list" option (see the UnidataObsConvention.html). Youdefinitetly want to use the record dimension, by the way, you couldsee a factor of 100 times slower without it. You could have thecurrent data come into a daily file (using linked lists?), then addthe daily file to the archive, with periodic rewriting of the file.But none of these are options unless the server can deal with it. Sothese are just some ideas that I am considering for the TDS server.
Im afraid Ive run out of time to go into more depth. We'll have tocontinue this after the holidays.
Happy Holidays!
I plan to solve the updating problem by leaving rows of missingvalues at the end of the data for each active buoy. As new datacomes in, I will replace the missing values with actual data. Then,I only have to rewrite the file (to add more rows of missingvalues) once in a while, not every time.
the above approach, if it works for you, probably obviates this.
Which approach sounds best? Is there another approach? Do you haveany advice?
Are sequences the wrong way to go? Of course, that could change ifone could efficiently access specific ranges from variables in aSequence/Structure. But it my understanding that that is notcurrently possible.
The DAP 2 spec currently does not allow this. But the whole point ofsequences is to allow you to subset using a query (i think itscalled a selection), and then only return the data needed, so youdont need index subsetting.
But if the server has to go through the whole file, it will never befast.
Although I gave this specific example, we store a lot ofsequence-like data where I work. Whatever .nc file structure isappropriate for the buoys will likely be appropriate for much ofthis other data. So I want to get it right.
Right now, Id say that it depends on what server you are using.Sequences are elegant, but they are a different animal from indexedaccess that is the bread and butter of netcdf files.
The critical things to answer first:
 1. How many records will you serve? What about in the future?
Approximately:
400 buoys * ~8 years of data * 8760 hours/year = ~28,000,000 records
Each record has 1 double and 15 floats; hence
  68 bytes/record * 28,000,000 records = ~1.9GB

More than half of the buoys are active so it will grow by about:
 300 buoys * 8760 hours/year = 2,628,000 records / year
 (almost 200MB/year)
Given that it is close to 2GB, I may separate it into a file forinactive buoys and a file for active buoys.
 2. What queries do you need to support?
 3. What response time is acceptable ?
I would like 1 second search time on the server + whatever thenetwork transmission time is for OPeNDAP to send me the results. Notethat the results are often/usually < 100 KB of data. I am willing todo a lot to get that response time, e.g., store the buoy locationsand time ranges in memory. Buoy readings are every hour, but withgaps. So perhaps I would constrain the times for each buoy's readingto be regularly spaced (e.g., missing data would appear as rows ofmissing values in the file), so that I can very quickly calculate therelevant row(s) of data based on time constraints.
4. What clients do you want to support? Just your own, or moregeneral?
I guess I only care about my client. But it seems like if I do thisright, it will be useful to any client that works with a givenserver. For me, OPeNDAP is here now and available for no effort on mypart. So, any OPeNDAP client can use the .nc file. Presumable, otherservers (LAS, THREDDS) could use the file, too, in the future.
 5. What server do you want to use? Does it matter?
It doesn't matter to me, except for ease of use. So I'm stronglyinclined to use one of the OPeNDAP servers which is alreadyadministered here (by someone else). I'll make the file. They'llserve it.
Sincerely,

Bob Simons
Satellite Data Product Manager
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
(831)658-3205
bob.simons@xxxxxxxx
<>< <>< <>< <>< <>< <>< <>< <>< <><


--
Sincerely,

Bob Simons
Satellite Data Product Manager
Environmental Research Division
NOAA Southwest Fisheries Science Center
1352 Lighthouse Ave
Pacific Grove, CA 93950-2079
(831)658-3205
bob.simons@xxxxxxxx
<>< <>< <>< <>< <>< <>< <>< <>< <><

References:
- Seeking .nc advice for sequence data
  - From: Bob Simons
- Re: Seeking .nc advice for sequence data
  - From: John Caron
- Re: Seeking .nc advice for sequence data
  - From: Bob Simons
- Re: Seeking .nc advice for sequence data
  - From: John Caron

2005 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the netcdf-java archives: