[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with large archives



Tennesse,
At NOAA's National Climatic Data Center, our NOMADS project (url below), deals with Tb of data. We use the GrADS Data Server (and OPeNDAP) and wgrib and other routines to organize and index our data by day (hr/fsct proj) and by model as they come in (about 150k grids/day) since we archive all of them. You may contact Dan Swank here at NCDC to discuss some specifics. He is cc:ed. Feel free to navigate the site for some additional info regarding our organization... Glenn
http://nomads.ncdc.noaa.gov/data-access.html


James Gallagher wrote:

Tennessee,

Did you ever get a reply (besides this one :-)?

James

On Jan 31, 2005, at 6:01 PM, Tennessee Leeuwenburg wrote:

Hi guys,

Firstly :

I have "solved" the problem with the bad characters. The problem is that the NetCDF reader that thredds uses makes use itself of the "urlPath" specification when coming back with the DDS and DAS. As such, if use the "=" character (among others) in the urlPath (even if it's in the path rather than the simple filename), it gets inserted into the DDS/DAS by the NetCDF reader, which causes errors down the track in the parser.

I have worked around the problem by having a separate internalService for each dataset. The "base" section can contain the illegal characters without polluting the DDS/DAS of files read by the NetCDF reader. For the moment this is fine, but is less than ideal. I may return to it after dealing with more pressing issues. In future I will look at encoding the illegal characters as escaped strings or encoded in some way, but it's tricky to be sure that you've covered all of the cases when thinking about those techniques.

Maybe once everything goes XML the problem will simply disappear, and I can just wait it out :)

Secondly :

I am trying to work out how to structure my data by date. I will have a number of data sets (NWP Models) which will get updated daily, or even multiple times per day. Quite quickly I will reach the point where I will have hundreds of data sets published. Even a week's worth of data at 2 per day across 3 sources is 42 data sets.

I have two tasks - one would be to automate the updating of the configuration files so that new data sets get incorporated as they become available, and the other would be structuring the data pages in a sensible way for users to access.

I was wondering what practises people might have adopted or found successful in the past with regards to handling large amounts of data? Have people typically arranged archive data as aggregations, or linked to archive catalogs from the top-level catalog? What have people found best?

Cheers,
-Tennessee

--
James Gallagher                jgallagher at opendap.org
OPeNDAP, Inc                   406.783.8663


-- Glenn K. Rutledge Meteorologist / Physical Scientist National Oceanic and Atmospheric Administration National Climatic Data Center 151 Patton Ave Asheville, North Carolina 28801 (828) 271-4097



NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.