[thredds] Fwd: NCML dataset aggregation caching issues

Tiago,

The email below describes the issues Jordan and I encountered.  We ended up 
using the construct in "Attempt #2" and still need to rename the generated 
cache files as the file name used for cache reads is not the same as cache 
writes (so the cache read will always miss).   To make the cacheing work with 
this construct one needs to hit the leaf aggregations separately (to generate 
the correctly named cache files) *or* hit the root aggregation and rename the 
cache files.

Tom

> 
> Background:
> 
> We are attempting to host some datasets representing daily climate 
> predictions.  We have datasets from 4 models each with the output of 2 
> scenarios with outputs for 3 variables (precip, temp_min and temp_max).  Each 
> variable is contained in a single file with data for a 10 year period.  All 
> data sets are using the same time units, days since 12-31-1959.  Each 
> scenario contains ~120 GB of data stored in NetCDF 3 format using a CF 
> gridded dataset convention.  We've been tasked with presenting the data for 
> each model/scenario pair as a single dataset.
> 
> For clarity, here is how we are representing this data  on disk (1 model and 
> 1 scenario, all variables and time periods):
> 
> model1.scenario1/
>       precip/
>               model1.scenario1.precip.1960.1969.nc
>               model1.scenario1.precip.1970.1979.nc
>               ...
>               model1.scenario1.precip.2090.2099.nc
>       temp_min/
>               model1.scenario1. temp_min.1960.1969.nc
>               model1.scenario1. temp_min.1970.1979.nc
>               ...
>               model1.scenario1. temp_min.2090.2099.nc
>       temp_max/
>               model1.scenario1. temp_max.1960.1969.nc
>               model1.scenario1. temp_max.1970.1979.nc
>               ...
>               model1.scenario1. temp_max.2090.2099.nc
> 
> 
> 
> Attempt #1:
> 
> Description:  Aggregate as a single NcML file with internal nested 
> aggregations, let's assume the file's path is 
> /data/model1.scenario1.internal.ncml
> 
> ===  /data/model1.scenario1.internal.ncml ===
> <?xml version="1.0" encoding="UTF-8"?>
> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
>       <aggregation type="union">
>               <netcdf>
>                       <aggregation type="joinExisting" dimName="time">
>                               <scan location="model1.scenario1/precip/" 
> suffix=".nc"/>
>                       </aggregation>
>               </netcdf>
>               <netcdf>
>                       <aggregation type="joinExisting" dimName="time">
>                               <scan location="model1.scenario1/temp_min/" 
> suffix=".nc"/>
>                       </aggregation>
>               </netcdf>
>               <netcdf>
>                       <aggregation type="joinExisting" dimName="time">
>                               <scan location="model1.scenario1/temp_max/" 
> suffix=".nc"/>
>                       </aggregation>
>               </netcdf>
>       </aggregation>
> </netcdf>
> ======
> 
> Observation:  No caching.  The same behavior exists wether using <scan /> or 
> multiple explicit <netcdf location="..." /> elements.
> 
> Investigation:  Each nested/leaf aggregation (for precip, temp_min or 
> temp_max) results in a cache read attempt on a cache file named 
> file-data-model1.scenario1.internal.ncml#null.  If this file exists it is 
> most likely unusable as the netcdf cache dataset ids will not match.  This is 
> a result of each nested/leaf cache utilizing the same file name.  As each 
> nested/leaf aggregation is processed it overwrites the cached result of the 
> prior nested/leaf aggregation ('temp_max' overwrites 'temp_min' overwrites 
> 'precip').
> 
> 
> 
> Attempt #2:
> 
> Description:  Aggregate with multiple NcML files with nested aggregations 
> contained in separate files.
> 
> === /data/model1.scenario.external.ncml ===
> <?xml version="1.0" encoding="UTF-8"?>
> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
>       <aggregation type="union">
>               <netcdf id="precip" location="model1.scenario.precip.ncml"/>
>               <netcdf id="temp_min" location="model1.scenario.temp_min.ncml"/>
>               <netcdf id="temp_max" location="model1.scenario.temp_max.ncml"/>
>       </aggregation>
> </netcdf>
> ===/data/model1.sceanario.precip.ncml===
> <?xml version="1.0" encoding="UTF-8"?>
> <netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
>       <aggregation type="joinExisting" dimName="time">
>               <scan location="model1.scenario1/precip/" suffix=".nc"/>
>       </aggregation>
> </netcdf>
> ======
> 
> Observation:  No caching.
> 
> Investigation:  The caching code attempts to read 
> file-data-model1.scenario1.precip.ncml but then writes to 
> file-data-model1.scenario1#file-data-model1.scenario1.precip.ncml.  Somehow 
> the cache name of the instance changes in-between 
> AggregationExisting.persistRead() and persistWrite().  The same object 
> instance is used for both these calls, but somewhere the cache name 
> changes...  You can rename the created cache files or generate a cache by 
> hitting the external ncmls individually, unless you catch this your 
> performance will suffer...   
> 
> 
> Tom Kunicki
> Center for Integrated Data Analytics
> U.S. Geological Survey
> 8505 Research Way
> Middleton, WI  53562
> 
> 
>