Re: [netcdf-java] point data

  • To: Lauren E Hay <lhay@xxxxxxxx>
  • Subject: Re: [netcdf-java] point data
  • From: Roland Viger <rviger@xxxxxxxx>
  • Date: Tue, 26 Jan 2010 19:19:09 -0000
Hi John,

I'll try to add a bit to Lauren's response. Hopefully the others will make 
sure I'm not mangling the technology or vision on this. So, yes on all 
three types of queries (including Lauren's additional one), but it might 
be the case that Lauren's case #3 (with a time period) is the only one 
that needs to be supported if we use a metadatabase to answer the rest of 
the questions (location, period of record, data quality, etc) before 
querying the actual data store. We might need to think about this on our 
end a little more.

As far as web services, we're expecting to serve all this through THREDDS 
or direct NetCDF reads. As far as clients, our first focus will be stuff 
we make ourselves. Home made web page interfaces and Java applications are 
the most important for the short term. Access to the data from other data 
servers (other instances of THREDDS, RAMADDA, or non-Unidata products like 
ERDDAP) is also on the horizon, but not really in the initial development 
cycle (Nate, Steve, do you agree with this?).  These other data servers 
may or may not be local to the data. We're open to suggestions--as Lauren 
said, we're just expecting to use OPeNDAP and/or direct reads with 
Java-NetCDF. If there are libraries or classes that we should be lifting 
out of IDV or otherwise leveraging, we would be very interested to hear 
about this. We have not really investigated the Unidata display and 
analysis offerings at all. 

I think the integer w/float offset plan sounds good as long as we can 
return the data in the original floating point form. Doesn't seem like 
carrying that transformation out is a big deal. Could it be embedded in 
the creation/streaming of the original NetCDF file that gets returned? 
Would be nice to avoid writing one NetCDF file, reading it, and then 
writing out/streaming the real result. 

Separating current and archived data might be a help, although our data 
set is updated only every couple of months. The "current" thing is not all 
that dynamic for us. Using this idea to break the history into 
conveniently sized blocks optimized for access should probably be our 
focus. Chunking might be good since some of our data sets go back a lot of 
years. I take it that NcML would be used to stich the chunks together as a 
single, temporally continuous virtual file. 

Part of our question about the arrangements of files is that we've 
normally had the full history each station in a separate file. We weren't 
sure yet how to use NcML to stitch these together. Rich says you've 
figured out how to do this spatial kind of stitching. We didn't know if 
this was the most efficient or whether to simply regenerate the NetCDF 
files according to other dimensions/variables. Not sure if we're closer to 
answering your question on this. Please let us know.

Roland



From:
Lauren E Hay/WRD/USGS/DOI
To:
John Caron <caron@xxxxxxxxxxxxxxxx>
Cc:
Steven Markstrom <markstro@xxxxxxxx>, netcdf-java 
<netcdf-java@xxxxxxxxxxxxxxxx>, Nate Booth <nlbooth@xxxxxxxx>, Rich 
Signell <rsignell@xxxxxxxx>, Roland Viger <rviger@xxxxxxxx>
Date:
01/25/2010 02:35 PM
Subject:
Re: [netcdf-java] point data




John 
Below are the answers to your questions -- let me know if it's not enough 
info. 
Lauren 
======================================
Lauren E. Hay, Ph.D.            Tel:    (303) 236-7279
U.S. Geological Survey          Fax:  (303) 236-5034
Box 25046, MS 412, DFC      Email: lhay@xxxxxxxx
Lakewood, CO 80225
====================================== 


From: 
John Caron <caron@xxxxxxxxxxxxxxxx> 
To: 
Rich Signell <rsignell@xxxxxxxx> 
Cc: 
netcdf-java <netcdf-java@xxxxxxxxxxxxxxxx>, Roland Viger 
<rviger@xxxxxxxx>, Steven Markstrom <markstro@xxxxxxxx>, Lauren E Hay 
<lhay@xxxxxxxx>, Nate Booth <nlbooth@xxxxxxxx> 
Date: 
01/25/2010 10:21 AM 
Subject: 
Re: [netcdf-java] point data





Hi Rich and all:

This is a interesting challenge on such a large datasets to get good read 
response. 

First, you have to decide what kinds of queries you want to support and 
what kind of response time is needed.  I have generally used the 
assumption that the common queries that you want to optimize are:
 1) get data over a time range for all stations in a lat/lon box.
 2) get data for a single station over a time range, or for all time. 
3) get data with a specified list of stations


Usually I would break the data into multiple files based on time range, 
aiming for a file size of 50-500 Mb. I also use a different format for 
current vs archived data, so that the current dataset can be added to 
dynamically, while the archived data is rewritten (once) for speed of 
retrieval. 

Again, all depends on what queries you want to optimize so ill wait for 
your thoughts on that. 
We ran into this problem in the past so we made a separate file for each 
station and each variable. Is there a problem with having too many files? 
Can we have a file by year that only contains stations with data for that 
year? Or -- if we don't care how many files -- 1 file for each station for 
each variable for each year. It does not matter to me. The current project 
will have data that has a set time period. We hope to use this structure 
for other projects that will have file updates as new data is collected. 

Another question is what clients need to access this data. Are you writing 
your own web service, do you just want remote access from IDV, or ?? 
We anticipate that our web serivces will use the OpenDAP API. I'm not the 
person to answer this one. 


I would think that if we're careful, we can get netcdf-4 sizes that are 
similar to compressed text, but we'll have to experiment. The data appears 
to be integer or float with a fixed dynamic range, which is amenable to 
storing as an integer with scale/offset. integer data compresses much 
better than floating point due to the noise in the low bits of the 
mantissa. So one task you should get started on is to examine each field 
and decide its data type. if floating point, decide on its range and the 
number of significant bits.



  • 2010 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the netcdf-java archives: