hundreds of thousands of files will be an interesting test of the TDS (maybe too interesting, like it may not work!).
in principle we can aggregate GRID files just fine, in practice there are some problems. Current aggregation assumes that the files are completely homogeneous: each has exactly the same variables and coordinate systems. in practice GRIB files often have missing records which screws things up.
we are working on this problem in a new "forecast model run" aggregation, that will tolerate missing records. We hope to have something to try by end of July. How homogeneous do you think your archive is?
with such a large number of files, you probably dont want to use a scan element (too slow). Maybe best that you explicitly list all the files in the aggregation (the catalog would just point to the ncml document). How often are files added to the archive?
for the random example i looked at (http://nomads.ncdc.noaa.gov/data/narr/200605/20060520/narr-a_221_20060520_0900_000.grb)
it looks like each file has only one time coordinate (?)
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2"> <aggregation dimName="time" type="joinExisting"> <netcdf location="file://test/temperature/jan.nc" ncoords="1"/>
let me run a few tests to see how this works here.
The NARR is a reanalysis, so it don't have forecast times. I would be a simple 03 hr chain (00 hr fct time) spanning 26 years.
See an existing GDS subset aggregation: http://nomads.ncdc.noaa.gov:9091/dods/NCEP_NARR_DAILY/narr-a_221_tmpprs.subset.info This will give a sense for the nature of the beast.
The directory structure is set up as such: http://nomads.ncdc.noaa.gov/data/narr/
Heres the TDS aggregation I set up while experimenting yesterday, on a non-related dataset:
<dataset name="OceanWinds Test Daily Aggregation" ID="test/dailyagg" urlPath="test/agg"> <serviceName>allTest</serviceName> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2"> <aggregation dimName="time" type="joinNew"> <variableAgg name="wind" /> <scan dateFormatMark="#yyyyMMdd" location="/eclipse1a/ftp/pub/seawinds/SI/daily/netcdf/1980s/" suffix=".nc" /> <scan dateFormatMark="#yyyyMMdd" location="/eclipse1a/ftp/pub/seawinds/SI/daily/netcdf/1990s/" suffix=".nc" /> <scan dateFormatMark="#yyyyMMdd" location="/eclipse1a/ftp/pub/seawinds/SI/daily/netcdf/2000s/" suffix=".nc" /> </aggregation> <variable name="time" orgName="time"> <attribute name="long_name" value="Days"/> <attribute name="units" value="days since 1987-07-09" /> </variable> </netcdf> </dataset>
Would this automatically detect the source of data were GRIB rather than NetCDF? and it seems like you need to set the <scan> on each individual directory... Doing so the way NARR is set up would create one chunky configuration file. Is there anyway to have this scan a pattern (YYYYMM/YYYYMMDD) of directories?
I understand GRIB requires a certain amount of "supplemented" metadata for complience. Where do you enter this?
Ethan Davis wrote the following on 6/13/2006 1:21 PM:
Aggregation should work the same for GRIB as for netCDF files. The issue would be how your GRIB files are structured and how you want to aggregate them. Our GRIB files each contain one full model run (all parameters, all forecast times). We haven't tried aggregating beyond that.
We have started tracking what is available for the NCEP models on our server. This is from the TDS 3.8 announcement (with links updated):
We also are now tracking detailed inventory of NCEP model output, eg:
These are all linked from the "collection dataset" pages; For
choose the top "CONUS_12_km" link, then choose "Available Inventory" Documentation.
One idea for this work is to eventually provide access to alternate datasets, for instance, a dataset that contains all the 3hr forecast times from the different runs, or one that contained all the 12Z valid times from the different runs. Tracking these detailed inventories is just the first step but aggregation and alternate groupings of the data is pretty interesting to think about.
How are your GRIB files structured and what kind of aggregation where you thinking about?
I've been tinkering with the TDS aggregation capabilities and they work quite well for NetCDF data, however, I can't seem to find anything in the docs regarding aggregating GRIB. We want to get The NARR dataset which we have here at NCDC-NOMADS on the TDS. It consists of hundreds of thousands of 50 Mb + GRIB files in a YYYYMM/YYYYMMDD tree. Just scouting for a quick answer here: Is aggregating the NARR GRIB currently feasable with the current release of TDS? If so, do any docs exist which could give me a starting point? Converting it to NetCDF will not be possible (volume).
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.