Re: [thredds] joinExisting and FMRC aggregation performance

  • To: "Signell, Richard" <rsignell@xxxxxxxx>
  • Subject: Re: [thredds] joinExisting and FMRC aggregation performance
  • From: Roy Mendelssohn - NOAA Federal <roy.mendelssohn@xxxxxxxx>
  • Date: Sat, 14 Mar 2015 11:06:23 -0700
We have some aggregations with more files than what you mention.  The problem 
is probably when they add a new file or files, so that the aggregation has to 
be updated,  That doesn’t happen until there is a data request, so the first 
data request takes a long time  (some of ours take minutes on reaggregation).  
So the trick is to have HFR radar folks make the first data request after the 
update.

We sort of do this by running Netcheck  (freely available at 
http://coastwatch.pfeg.noaa.gov/coastwatch/NetCheck.html).    Files that update 
frequently get checked frequently.  Sometimes a request comes in before the 
Netcheck does the check, but that is rare on our servers.  Or you have whatever 
script produces the update file than make a request.  As I said, if you make 
the request on the local machine using a “localhost” address the request will 
not time out and the (re)-aggregation will be completed.

For example 
http://oceanview.pfeg.noaa.gov/thredds/dodsC/Model/FNMOC/6hr_pressure.html has 
over 70,000 files. Of course, now that I have linked it as soon as you go to it 
the response will take for ever!, but I just tested it and it was pretty quick.

The FMRC as is being implemented speeds this all up by creating index files and 
then running a separate process (TDM) to update the aggregations without 
waiting for a data request..

-Roy



On Mar 14, 2015, at 10:39 AM, John Caron <caron@xxxxxxxx> wrote:

> I find it amazing that things work on that large of an NcML or even FMRC 
> collection. Just goes to show what I know. Anyway, Im about to embark on 
> studying where the bottlenecks are. 
> 
> The code isnt so much poorly written, as it simply wasnt designed with high 
> scaleability in mind. The solution is to write persistent "index" files so 
> that, once indexed, the logical "collection datasets" can be very quickly 
> accessed. Im going to take what I have been doing in GRIB and apply it to 
> netCDF, and GRID data in general.
> 
> An NcML aggregation like a joinExisting may be specified inside the catalog 
> config or outside in a separate NcML file and referenced in a dataset or 
> datasetScan. In both cases, nothing is done until it is requested by a user. 
> At that point, if the dataset has already been constructed and is in the TDS 
> cache, and doesnt need updating, then its fast. 
> 
> A featureCollection has a new set of functionality to update the dataset in 
> the background. FMRC does some extra "persistent caching" (make some of the 
> info persist between TDS restarts).  Still not enough, but better than NcML. 
> GRIB collections now do this well. However if the collection is changing, a 
> seperate process (TDM) will handle updating and notifying the TDS. That keeps 
> the code from getting too complex and greatly simplifies getting the object 
> caching right.
> 
> Read-optimized netcdf-4 files are an elegant solution indeed. Dave, maybe 
> sometime you could share your workflow in some place we could link to in our 
> documentation?
> 
> 
> 
> On Sat, Mar 14, 2015 at 10:47 AM, Signell, Richard <rsignell@xxxxxxxx> wrote:
> John,
> 
> > NcML Aggregations should only be used for small collections of files ( a few
> > dozen?) , because they are created on-the-fly.
> 
> The HFRADAR data is using a joinExisting aggregation in a THREDDS
> catalog.   Is that what you are calling NcML aggregation?
> I was thinking that NcML aggregation referred to the practice of
> writing an NcML file and dropping that into a folder along with the
> data files where it can be picked up by a DatasetScan.
> 
> > FMRC does a better job of
> > caching information so things go quicker. It handles the case of a single
> > time dimension as a special case of a Forecast model collection. However,
> > they too are limited in how much they will scale up, (< 100 ?)
> >
> > So how many files and variables are in the HF Radar collection?
> 
> There are currently 27,986 NetCDF files in the aggregation, each with
> a single time record containing the HF radar data for the hour.    It
> seems that the FMRC is handling this just fine, with reliable WMS
> response times of about one second.
> 
> As Dave Blodgett points out, a better approach here might be to
> periodically combine a bunch of these hourly files into, say, monthly
> files, which would result in higher performance, less utilization of
> disk space, and quicker aggregation.
> 
> I still don't understand what is happening with the joinExisting
> aggregation, however -- why it periodically (but not regularly) takes
> 50 seconds or more to respond.
> 
> --
> Dr. Richard P. Signell   (508) 457-2229
> USGS, 384 Woods Hole Rd.
> Woods Hole, MA 02543-1598
> 
> _______________________________________________
> thredds mailing list
> thredds@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/

**********************
"The contents of this message do not reflect any position of the U.S. 
Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
***Note new address and phone***
110 Shaffer Road
Santa Cruz, CA 95060
Phone: (831)-420-3666
Fax: (831) 420-3980
e-mail: Roy.Mendelssohn@xxxxxxxx www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected" 
"the arc of the moral universe is long, but it bends toward justice" -MLK Jr.



  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: