Re: Dealing with large archives
- To: Tennessee Leeuwenburg <address@hidden>
- Subject: Re: Dealing with large archives
- From: Ethan Davis <address@hidden>
- Date: Thu, 03 Feb 2005 18:21:33 -0700
Tennessee Leeuwenburg wrote:
I am trying to work out how to structure my data by date. I will have
a number of data sets (NWP Models) which will get updated daily, or
even multiple times per day. Quite quickly I will reach the point
where I will have hundreds of data sets published. Even a week's worth
of data at 2 per day across 3 sources is 42 data sets.
I have two tasks - one would be to automate the updating of the
configuration files so that new data sets get incorporated as they
become available, and the other would be structuring the data pages in
a sensible way for users to access.
The THREDDS catalog generation tool can automate generation of catalogs
but it does not generate aggregation server config files. Actually, it
can generate the parts that aren't aggregations, i.e., the plain THREDDS
catalogs parts of the config file. I've always wanted to extend it to
deal with the aggregation part of the aggServer config but have never
gotten around to doing so.
We're currently working on the next release of the THREDDS server. The
OPeNDAP netCDF server side of that should be quite a bit easier to
configure (e.g., give it a directory and it serves all the files in that
directory that match a certain pattern). The configuration for the
aggregation part of the server is still up in the air but it will very
likely be different from the current configuration syntax. This should
get ironed out in the next 3-6 months. In the mean time, you might take
a look at the catalog generator
and see if that helps any.
I was wondering what practises people might have adopted or found
successful in the past with regards to handling large amounts of data?
Have people typically arranged archive data as aggregations, or linked
to archive catalogs from the top-level catalog? What have people found
For some of our large and/or rapidly changing data collections, we have
setup a data collection subsetting capability. Basically, we have a
document that defines the set of allowed subsetting queries for that
collection and then a service that responds to those queries generally
with a THREDDS catalog of the requested subset. This is pretty alpha
stuff and we haven't really advertised it much but we find it useful.
Some rough documentation on this is available at
Ethan R. Davis Telephone: (303) 497-8155
Software Engineer Fax: (303) 497-8690
UCAR Unidata Program Center E-mail: address@hidden
P.O. Box 3000
Boulder, CO 80307-3000 http://www.unidata.ucar.edu/
NOTE: All email exchanges with Unidata User Support are recorded in the
Unidata inquiry tracking system and then made publicly available
through the web. If you do not want to have your interactions made
available in this way, you must let us know in each email you send to us.