Re: [thredds] joinExisting and FMRC aggregation performance

  • To: "Signell, Richard" <rsignell@xxxxxxxx>
  • Subject: Re: [thredds] joinExisting and FMRC aggregation performance
  • From: John Caron <caron@xxxxxxxx>
  • Date: Sat, 14 Mar 2015 11:39:33 -0600
I find it amazing that things work on that large of an NcML or even FMRC
collection. Just goes to show what I know. Anyway, Im about to embark on
studying where the bottlenecks are.

The code isnt so much poorly written, as it simply wasnt designed with high
scaleability in mind. The solution is to write persistent "index" files so
that, once indexed, the logical "collection datasets" can be very quickly
accessed. Im going to take what I have been doing in GRIB and apply it to
netCDF, and GRID data in general.

An NcML aggregation like a joinExisting may be specified inside the catalog
config or outside in a separate NcML file and referenced in a dataset or
datasetScan. In both cases, nothing is done until it is requested by a
user. At that point, if the dataset has already been constructed and is in
the TDS cache, and doesnt need updating, then its fast.

A featureCollection has a new set of functionality to update the dataset in
the background. FMRC does some extra "persistent caching" (make some of the
info persist between TDS restarts).  Still not enough, but better than
NcML. GRIB collections now do this well. However if the collection is
changing, a seperate process (TDM) will handle updating and notifying the
TDS. That keeps the code from getting too complex and greatly simplifies
getting the object caching right.

Read-optimized netcdf-4 files are an elegant solution indeed. Dave, maybe
sometime you could share your workflow in some place we could link to in
our documentation?



On Sat, Mar 14, 2015 at 10:47 AM, Signell, Richard <rsignell@xxxxxxxx>
wrote:

> John,
>
> > NcML Aggregations should only be used for small collections of files ( a
> few
> > dozen?) , because they are created on-the-fly.
>
> The HFRADAR data is using a joinExisting aggregation in a THREDDS
> catalog.   Is that what you are calling NcML aggregation?
> I was thinking that NcML aggregation referred to the practice of
> writing an NcML file and dropping that into a folder along with the
> data files where it can be picked up by a DatasetScan.
>
> > FMRC does a better job of
> > caching information so things go quicker. It handles the case of a single
> > time dimension as a special case of a Forecast model collection. However,
> > they too are limited in how much they will scale up, (< 100 ?)
> >
> > So how many files and variables are in the HF Radar collection?
>
> There are currently 27,986 NetCDF files in the aggregation, each with
> a single time record containing the HF radar data for the hour.    It
> seems that the FMRC is handling this just fine, with reliable WMS
> response times of about one second.
>
> As Dave Blodgett points out, a better approach here might be to
> periodically combine a bunch of these hourly files into, say, monthly
> files, which would result in higher performance, less utilization of
> disk space, and quicker aggregation.
>
> I still don't understand what is happening with the joinExisting
> aggregation, however -- why it periodically (but not regularly) takes
> 50 seconds or more to respond.
>
> --
> Dr. Richard P. Signell   (508) 457-2229
> USGS, 384 Woods Hole Rd.
> Woods Hole, MA 02543-1598
>
  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: