Re: [thredds] GRIB collection keeps re-scanning, data extraction extremely slow

To: thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] GRIB collection keeps re-scanning, data extraction extremely slow
From: John Caron <caron@xxxxxxxxxxxxxxxx>
Date: Thu, 29 Aug 2013 18:54:45 -0600

Hi Hein:

The first time that TDS 4.3 reads a grib file, it has to create indices(*.gbx9 and *.ncx). That will be very slow the first time, then muchfaster after that. Check to see if thats whats happening. Wait till allthe index files are created, and then see how fast things are.


John

On 8/29/2013 6:52 AM, Hein Zelle wrote:

Dear all,

coming back to my question on NCEP grib aggregation a while back:
John Caron wrote:

What you are seeing is the limitations of aggregations. In this
case, there are 3 different time coordinates in the collection, but
NcML aggregation can only aggregate on one of them.  You want to use
feature collections instead. Replace your entire <dataset> element
with something like:

<featureCollection name="myCollectionName" featureType="GRIB" 
path="grib/NCEP/GFS/etc">
         <collection 
spec="/pub/data/nccf/com/gfs/prod/gfs.2013060600/gfs\.t00z\.pgrb2f..$"
                     dateFormatMark="#prod/gfs.#yyyyMMddHH"  />
       </featureCollection>


I've finally implemented the GRIB featurecollection, and it seems to
be working.  I can access my grib files (although indexing of the full
ncep ensemble takes a while!) and the data comes out OK.  (timeseries
for a single location, for each ensemble member).

I'm experiencing a problem though: data extraction is extremely slow.
I'm comparing to my old situation where I listed all ensemble member
grib files in a single ncml file.  A data extraction for one location
(all members, for a single 15 day forecast) took 5 minutes in thredds
4.1 with this system.

In the new situation (thredds 4.3), I use a grib feature collection
sorted by directory (forecast cycle).  After half an hour, the
extraction is still not done.  In the featureCollectionScan.log I can
see that thredds keeps scanning all folders:

[2013-08-29T12:40:28.786+0000] INFO  thredds.inventory.MFileCollectionManager: 
2013082618 : was scanned MCollection{name='2013082618', 
dirName='/output/operational/atmosphere/ncep/gefs/1.0deg/2013082618', 
wantSubdirs=true, ff=WildcardMatchOnPath{wildcard=null 
regexp=gefs\..*\.f.*\.grib2$}}


It does this for ALL forecast cycle folders (I have about 20), even
though I am accessing only the 2013082900 directory.  Could anyone
give me tips on how to prevent thredds from continuously re-scanning
the whole directory structure with grib files?

Current setup:

     <featureCollection name="gefs_col" featureType="GRIB" 
path="ncep/gefs/1.0deg">
       <!-- be specific here with the file selector, other grib2 files may be 
hanging around in the tree -->
       <collection 
spec="/output/operational/atmosphere/ncep/gefs/1.0deg/**/gefs\..*\.f.*\.grib2$"
                   dateFormatMark="#0deg/#yyyyMMddHH"
                   timePartition="directory"
                   name="gefs_col_unique" />
       <update startup="true" trigger="allow"/>
     </featureCollection>


This organizes the data the way I want: I get a single url per cycle:
      .../thredds/dodsC/ncep/gefs/1.0deg/2013082900/best
                                     .../2013082800/best
                                     .../2013082700/best

The data comes out the way I want, but as mentioned above it's
_extremely_ slow, likely due to re-scanning of the disk structure.

I don't really need automatic updating, a manual trigger when a new
forecast is downloaded would be ok too.  I would prefer thredds to
scan and index the grib files only once upon a manual trigger.

Any hints on how to improve this?


Kind regards,
      Hein Zelle

Send thredds mailing list submissions to
        thredds@xxxxxxxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.unidata.ucar.edu/mailman/listinfo/thredds
or, via email, send a message with subject or body 'help' to
        thredds-request@xxxxxxxxxxxxxxxx

You can reach the person managing the list at
        thredds-owner@xxxxxxxxxxxxxxxx

When replying, please edit your Subject line so it is more specific
than "Re: Contents of thredds digest..."


thredds mailing list
thredds@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit: 
http://www.unidata.ucar.edu/mailing_lists/

Today's Topics:

    1. Re: aggregating GFS data, problem with accumulated (Hein Zelle)
    2. Re: aggregating GFS data, problem with accumulated (John Caron)


----------------------------------------------------------------------

Message: 1
Date: Thu, 6 Jun 2013 11:30:36 +0200
From: Hein Zelle <hein.zelle@xxxxxxxxxxxxx>
To: thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] aggregating GFS data, problem with accumulated
Message-ID: <20130606093035.GA17727@xxxxxxxxxxxxxxxxxxxx>
Content-Type: text/plain; charset="us-ascii"

Dear John,

attached to this email is a complete ncml file that we place next to
the data files.  The data files themselves are too big to upload, but
they are standard gfs grib2 files, you can find them at

ftp://ftpprd.ncep.noaa.gov/pub/data/nccf/com/gfs/prod/gfs.2013060600

(that's for this morning, modify the date as needed)
The files are the grib2 files at 0.5 degree, e.g.  gfs.t00z.pgrb2bf30
(50 mb each).

The previous snippet of ncml I sent should also work, you'll have to modify
the paths to the correct folder of course.  A variable to check is for
example

Total_precipitation_surface_3_Hour_Accumulation

These should have multiple time steps, but I get only 1 time step (the
first, for the +03 forecast). The +00 analysis doesn't contain the
precipitation fields.  Any variable with an accumulation or averaging
interval exhibits the problem.


Kind regards,
      Hein Zelle

References:
- [thredds] GRIB collection keeps re-scanning, data extraction extremely slow
  - From: Hein Zelle

2013 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: