Re: [thredds] cache options for large aggregated datasets

To: thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] cache options for large aggregated datasets
From: tnb@xxxxxxxxxxxxxxxx
Date: Mon, 02 May 2011 15:05:03 -0300

Quoting thredds-request@xxxxxxxxxxxxxxxx:

Send thredds mailing list submissions to
        thredds@xxxxxxxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit
        http://mailman.unidata.ucar.edu/mailman/listinfo/thredds
or, via email, send a message with subject or body 'help' to
        thredds-request@xxxxxxxxxxxxxxxx

You can reach the person managing the list at
        thredds-owner@xxxxxxxxxxxxxxxx

When replying, please edit your Subject line so it is more specific
than "Re: Contents of thredds digest..."


thredds mailing list
thredds@xxxxxxxxxxxxxxxx

For list information or to unsubscribe, visit:http://www.unidata.ucar.edu/mailing_lists/


Today's Topics:

   1. Re: thredds Digest, Vol 27, Issue 33 (tnb@xxxxxxxxxxxxxxxx)
   2. Fwd: NCML dataset aggregation caching issues (Tom Kunicki)


----------------------------------------------------------------------

Message: 1
Date: Thu, 28 Apr 2011 19:19:27 -0300
From: tnb@xxxxxxxxxxxxxxxx
To: thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] thredds Digest, Vol 27, Issue 33
Message-ID: <20110428191927.16675lhs237nt9f4@xxxxxxxxxxxxxxxxxxxxxxxx>
Content-Type: text/plain; charset=ISO-8859-1; DelSp="Yes";
        format="flowed"


Jordan,
when you say to create separate ncml files for each joinExisting
aggregation, do you do this just to get the cache file generated by
thredds and after merge then manually in one cache file, isn't that?

Thanks again!

------------------------------

Message: 3
Date: Mon, 25 Apr 2011 21:22:54 -0500
From: Jordan Walker <jiwalker@xxxxxxxx>
To: thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] cache options for large aggregated datasets
Message-ID: <4DB62C7E.6050609@xxxxxxxx>
Content-Type: text/plain; charset="iso-8859-1"

Tiago,
We've notice this same problem with a couple of our datasets.  The
problem lies in the fact that you are doing a union of several
joinExisting aggregations.  When you run the aggregation you will get an
aggregation file for the union, but if you look at it, the variables
will match the last joinExisting aggregation in that union (it may look
like $ncml_file#null).  What we have done to fix this is create separate
ncml files for each joinExisting aggregation, and a single ncml file for
the union.  If possible run the individual aggregations on their own to
generate the cache file, otherwise have the union run and copy the cache
file to match what it should be for the joinExistings and alter the
contents to match that aggregation.

This is not an ideal solution, so I welcome other suggestions that solve
it better.  But this is a solution that will bridge the gap until this
type of aggregation is better supported.

--
Jordan Walker
Center for Integrated Data Analytics
U.S. Geological Survey
8505 Research Way
Middleton, WI  53562
jiwalker@xxxxxxxx
http://cida.usgs.gov <http://cida.usgs.gov/>

On 04/25/2011 06:18 PM, tnb@xxxxxxxxxxxxxxxx wrote:

Send thredds mailing list submissions to
    thredds@xxxxxxxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit
    http://mailman.unidata.ucar.edu/mailman/listinfo/thredds
or, via email, send a message with subject or body 'help' to
    thredds-request@xxxxxxxxxxxxxxxx

You can reach the person managing the list at
    thredds-owner@xxxxxxxxxxxxxxxx

When replying, please edit your Subject line so it is more specific
than "Re: Contents of thredds digest..."


thredds mailing list
thredds@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit:
http://www.unidata.ucar.edu/mailing_lists/

Today's Topics:

   1. Re: cache options for large aggregated datasets (John Caron)


----------------------------------------------------------------------

Message: 1
Date: Fri, 22 Apr 2011 13:30:31 -0600
From: John Caron <caron@xxxxxxxxxxxxxxxx>
To: thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] cache options for large aggregated datasets
Message-ID: <4DB1D757.80700@xxxxxxxxxxxxxxxx>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 4/22/2011 11:21 AM, tnb@xxxxxxxxxxxxxxxx wrote:

Hi everybody!

I have installed Tomcat6 and Thredds 4.2, and everything is working
fine, i just have some questions about performance on the access of
large aggregated dataset.

I am serving some aggregated data (about 38 Gb) and when i try to
acess the Dataset Access Form from the catalog, thredds spend too much
time to show me the page.


can you send the aggregation element?


hi john!

I will send a little part of it (because it 's too many lines), just
two variables, but the aggregate have a lot more.

    <dataset name="NCEP II - test" ID="ncep2-test"
urlPath="reanalise/ncep2.nc">
    <serviceName>odap</serviceName>
    <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
    <aggregation type="union">
      <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
      <aggregation dimName="time" type="joinExisting">
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1979.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1980.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1981.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1982.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1983.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1984.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1985.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1986.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1987.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1988.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1989.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1990.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1991.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1992.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1993.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1994.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1995.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1996.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1997.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1998.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.1999.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2000.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2001.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2002.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2003.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2004.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2005.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2006.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2007.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2008.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/air.2009.nc"/>
    </aggregation>
    </netcdf>
      <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
      <aggregation dimName="time" type="joinExisting">
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1979.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1980.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1981.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1982.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1983.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1984.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1985.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1986.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1987.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1988.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1989.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1990.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1991.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1992.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1993.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1994.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1995.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1996.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1997.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1998.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.1999.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2000.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2001.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2002.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2003.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2004.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2005.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2006.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2007.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2008.nc"/>
      <netcdf location="/data/REANALISE/NCEP/NCEP-II/hgt.2009.nc"/>
    </aggregation>
    </netcdf>
     .
     .
     .


I try enable all the cache options (netcdffile cache, aggregation
cache and netcdfdataset cache), but it still take a lot of time (about
3 minutes) to open the Data Acess Form and consequentily to acess the
data.


does that happen only the first time? what about the second time ?


Once i open the Data Acess Form, the second time is fast, but after
some time (some hours), if i try to acess again the same dataset, it
get longer time to open, same as the first time.


Is this behavior normal with this size of aggregate dataset?

This is my cache options in threddsConfig.xml

...
<AggregationCache>
<scour>-1 hours</scour>


why -1 ?


i put -1 because my aggregations never change, so i dont want the
cachefiles got deleted.

in thredds page
(http://www.unidata.ucar.edu/projects/THREDDS/tech/tds4.2/reference/ThreddsConfigXMLFile.html)
"Every scour amount of time, any item that hasnt been changed since
maxAge time will be deleted. Set scour to -1 to not scour if you have
aggregations that never change. Otherwise, make maxAge longer than the
longest time between changes. Basically, you dont want to remove
active aggregations."


the cache filenames are ok? the end part #null dont mean nothing?

<maxAge>30 days</maxAge>
</AggregationCache>

<NetcdfFileCache>
<minFiles>200</minFiles>
<maxFiles>400</maxFiles>
<scour>30 min</scour>
</NetcdfFileCache>

<NetcdfDatasetCache>
<minFiles>100</minFiles>
<maxFiles>200</maxFiles>
<scour>30 min</scour>
</NetcdfDatasetCache>
...

Did i forgeted some aditional setup option?

Also i have noted that in the directory
$TOMCAT/content/thredds/cache/agg, the filename of the caches is
ending with #null,
(e.g. reanalise-ncep1.nc#null) is there something wrong during the
creation of the cache?


Thanks for attention!


Tiago Bomventi


_______________________________________________
thredds mailing list
thredds@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit:
http://www.unidata.ucar.edu/mailing_lists/




End of thredds Digest, Vol 27, Issue 29
***************************************




thanks again!

Tiago Bomventi



_______________________________________________
thredds mailing list
thredds@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit:
http://www.unidata.ucar.edu/mailing_lists/


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mailman.unidata.ucar.edu/mailing_lists/archives/thredds/attachments/20110425/717ba271/attachment.html>

End of thredds Digest, Vol 27, Issue 33
***************************************







------------------------------

Message: 2
Date: Thu, 28 Apr 2011 21:52:24 -0500
From: Tom Kunicki <tkunicki@xxxxxxxx>
To: thredds@xxxxxxxxxxxxxxxx
Subject: [thredds] Fwd: NCML dataset aggregation caching issues
Message-ID: <399C9FB4-9A8B-487E-AA26-BF741362CAA8@xxxxxxxx>
Content-Type: text/plain; charset="us-ascii"

Tiago,

The email below describes the issues Jordan and I encountered. Weended up using the construct in "Attempt #2" and still need torename the generated cache files as the file name used for cachereads is not the same as cache writes (so the cache read will alwaysmiss). To make the cacheing work with this construct one needs tohit the leaf aggregations separately (to generate the correctlynamed cache files) *or* hit the root aggregation and rename thecache files.

Tom

Background:
We are attempting to host some datasets representing daily climatepredictions. We have datasets from 4 models each with the outputof 2 scenarios with outputs for 3 variables (precip, temp_min andtemp_max). Each variable is contained in a single file with datafor a 10 year period. All data sets are using the same time units,days since 12-31-1959. Each scenario contains ~120 GB of datastored in NetCDF 3 format using a CF gridded dataset convention.We've been tasked with presenting the data for each model/scenariopair as a single dataset.
For clarity, here is how we are representing this data on disk (1model and 1 scenario, all variables and time periods):
model1.scenario1/
        precip/
                model1.scenario1.precip.1960.1969.nc
                model1.scenario1.precip.1970.1979.nc
                ...
                model1.scenario1.precip.2090.2099.nc
        temp_min/
                model1.scenario1. temp_min.1960.1969.nc
                model1.scenario1. temp_min.1970.1979.nc
                ...
                model1.scenario1. temp_min.2090.2099.nc
        temp_max/
                model1.scenario1. temp_max.1960.1969.nc
                model1.scenario1. temp_max.1970.1979.nc
                ...
                model1.scenario1. temp_max.2090.2099.nc



Attempt #1:
Description: Aggregate as a single NcML file with internal nestedaggregations, let's assume the file's path is/data/model1.scenario1.internal.ncml
===  /data/model1.scenario1.internal.ncml ===
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
        <aggregation type="union">
                <netcdf>
                        <aggregation type="joinExisting" dimName="time">
                                <scan location="model1.scenario1/precip/" 
suffix=".nc"/>
                        </aggregation>
                </netcdf>
                <netcdf>
                        <aggregation type="joinExisting" dimName="time">
                                <scan location="model1.scenario1/temp_min/" 
suffix=".nc"/>
                        </aggregation>
                </netcdf>
                <netcdf>
                        <aggregation type="joinExisting" dimName="time">
                                <scan location="model1.scenario1/temp_max/" 
suffix=".nc"/>
                        </aggregation>
                </netcdf>
        </aggregation>
</netcdf>
======
Observation: No caching. The same behavior exists wether using<scan /> or multiple explicit <netcdf location="..." /> elements.
Investigation: Each nested/leaf aggregation (for precip, temp_minor temp_max) results in a cache read attempt on a cache file namedfile-data-model1.scenario1.internal.ncml#null. If this file existsit is most likely unusable as the netcdf cache dataset ids will notmatch. This is a result of each nested/leaf cache utilizing thesame file name. As each nested/leaf aggregation is processed itoverwrites the cached result of the prior nested/leaf aggregation('temp_max' overwrites 'temp_min' overwrites 'precip').
Attempt #2:
Description: Aggregate with multiple NcML files with nestedaggregations contained in separate files.
=== /data/model1.scenario.external.ncml ===
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
        <aggregation type="union">
                <netcdf id="precip" location="model1.scenario.precip.ncml"/>
                <netcdf id="temp_min" location="model1.scenario.temp_min.ncml"/>
                <netcdf id="temp_max" location="model1.scenario.temp_max.ncml"/>
        </aggregation>
</netcdf>
===/data/model1.sceanario.precip.ncml===
<?xml version="1.0" encoding="UTF-8"?>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
        <aggregation type="joinExisting" dimName="time">
                <scan location="model1.scenario1/precip/" suffix=".nc"/>
        </aggregation>
</netcdf>
======

Observation:  No caching.
Investigation: The caching code attempts to readfile-data-model1.scenario1.precip.ncml but then writes tofile-data-model1.scenario1#file-data-model1.scenario1.precip.ncml.Somehow the cache name of the instance changes in-betweenAggregationExisting.persistRead() and persistWrite(). The sameobject instance is used for both these calls, but somewhere thecache name changes... You can rename the created cache files orgenerate a cache by hitting the external ncmls individually, unlessyou catch this your performance will suffer...
Tom Kunicki
Center for Integrated Data Analytics
U.S. Geological Survey
8505 Research Way
Middleton, WI  53562



Tom and Jordan,
I will study your solution!

Thanks for the help!

cheers!

Tiago Bomventi

2011 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: