Re: [thredds] Aggregating Large NetCDF Datasets with Restricted Access

  • To: Kevin Manross <manross@xxxxxxxx>
  • Subject: Re: [thredds] Aggregating Large NetCDF Datasets with Restricted Access
  • From: "Antonio S. Cofiño" <antonio.cofino@xxxxxxxxx>
  • Date: Fri, 20 Dec 2013 21:59:19 +0100
Yes,

because the JoinExisting aggregation needs to open each file to know the corresponding time coordinate value. As you mention the trade-off is that the ncml could be big and possibly you could get some issues when the ncml has too many elements.

The other solution could be use what is mentioned in the Aggregation tutotirial: If there is exactly one time slice in each file of the JoinExisting aggregation, and you are using a scan element to dynamically scan the files in a directory, then you can use the dateFormatMark attribute to derive the date from the filename.

but it depends on the layout of your dataset.

Regards

Antonio

--
Antonio S. Cofiño
Grupo de Meteorología de Santander
Dep. de Matemática Aplicada y
       Ciencias de la Computación
Universidad de Cantabria
http://www.meteo.unican.es


El viernes, 20 de diciembre de 2013 21:47:41, Kevin Manross escribió:

Thanks Antonio!

I'll definitely give this idea a shot.

Is there any performance hit if listing several thousand files in the
catalog (as opposed to scanning the directory)?

Thanks again!

-kevin.

On 12/20/13 1:34 PM, "Antonio S. Cofiño" wrote:
Kevin,

To improve the JoinExisting aggregation you can substitute the inner
scan element by adding explicitly files (explicitly) you want
aggregate and add the ncoords or the coordValue attribute to the
netcdf element as it's been explained in the "Defining coordinates on
a JoinExisting aggregation" section of the Aggregation document:
http://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/v2.2/Aggregation.html


Be sure the aggregation cache for the TDS config is configured:
http://www.unidata.ucar.edu/software/thredds/current/tds/tds4.3/reference/ThreddsConfigXMLFile.html#AggregationCache


I hope this help.

Regards

Antonio



--
Antonio S. Cofiño
Grupo de Meteorología de Santander
Dep. de Matemática Aplicada y
        Ciencias de la Computación
Universidad de Cantabria
http://www.meteo.unican.es

El 20/12/2013 19:28, Kevin Manross escribió:

Seasons Greetings!

I really wish we didn't have these restrictions on data, but that's
what I'm dealing with so please bear with me.

We have some large (33 Tb, 840 Gb, etc) netCDF datasets that I am
trying to aggregate.  Many are in "time series" layout (I.e., single
parameter grid spread out across many time steps [files], such as
u10/u10_RCPP_2004_11.nc, u10/u10_RCPP_2004_12.nc, etc.)

I initially tried a large nested aggregations such as:

<dataset name="ds601.0-Agg"
ID="ds601.0-AGG"
      & nbsp;&nbs p; urlPath="ds601.0/10/best"
harvest="true">
<metadata inherited="true">
<serviceName>all</serviceName>
<dataFormat>NetCDF</dataFormat>
<dataType>GRID</dataType>
</metadata>
<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
<!--attribute name="title" type="string" value="20th Century
Simulation Yearly Timeseries-Parameter Aggregations"/-->
<aggregation type="Union">
<netcdf>
<aggregation dimName="time" type="joinExisting">
<scan location="/glade/p/rda/data/ds601.0/RCPP/1995_2005/glw/"
suffix=".nc" subdirs="true"/>
             </aggregation>
</netcdf>
<netcdf>
<aggregation dimName="time" type="joinExisting">
  &n bsp; ; <scan
location="/glade/p/rda/data/ds601.0/RCPP/1995_2005/graupel/"
suffix=".nc" subdirs="true"/>
</aggregation>
</netcdf>
<netcdf>
<aggregation dimName="time" type="joinExisting">
<scan location="/glade/p/rda/data/ds601.0/RCPP/1995_200 5/olr/"
suffix=".nc" subdirs="true"/>
</aggregation>
</netcdf>
<netcdf>
          &nb sp; <aggregation dimName="time" type="joinExisting">
<scan location="/glade/p/rda/data/ds601.0/RCPP/1995_2005/psfc/"
suffix=".nc" subdirs="true"/>
</aggregation>
</netcdf>
                      ...
                      ...
                      ...
</aggregation>
</netcdf>
</dataset>


This takes a long time to build the cache file, and upon each
revisit it goes through the process of rebuilding the file.
Honestly, it is unusable this way from a user standpoint. However,
everything works with the restrictions I have set up via Tomcat
DataSourceRealm and webapps/thredds/WEB-INF/web.xml

Mike McDonald had a really slick way to aggregate and cache the
parameter timeseries files, and then build the union on demand. (see
his response to the thread '"Too Many Open Files" Error. Dataset too
big?' on 28 October 2013) .  So using his example, I reformatted my
catalog as such:

       <dataset name="Full Aggregation of ds601.0"
         ID="ds601.0-AGG"
         urlPath="aggregations/ds601.0/10/best"
         harvest="true">
         <serviceName>all</serviceName>
         <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
           <aggregation type="Union">
               <netcdf
location="dods://localhost:8080/thredds/dodsC/internal/ds601.0/101/glw"/>
               <netcdf
location="dods://localhost:8080/thredds/dodsC/internal/ds601.0/102/graupel"/>
               <netcdf
location="dods://localhost:8080/thredds/dodsC/internal/ds601.0/103/olr"/>
               <netcdf
location="dods://localhost:8080/thredds/dodsC/internal/ds601.0/104/psfc"/>
                ...
                ...
                ...
           </aggregation>
         </netcdf>
       </dataset>

       <dataset name="internal/ds601.0 Aggregation (glw)"
         ID="internal/ds601.0/101/glw"
         urlPath="internal/ds601.0/101/glw"

         harvest="true">
         <serviceName>all</serviceName>
         <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
             <aggregation dimName="time" type="joinExisting">
                <scan
location="/data/glade/p/rda/data/ds601.0/RCPP/1995_2005/glw/"
suffix=".nc" subdirs="true"/>
             </aggregation>
           </netcdf>
       </dataset>


       <dataset name="internal/ds601.0 Aggregation (graupel)"
         ID="internal/ds601.0/102/graupel"
         urlPath="internal/ds601.0/102/graupel"

         harvest="true">
         <serviceName>all</serviceName>
         <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
             <aggregation dimName="time" type="joinExisting">
                <scan
location="/data/glade/p/rda/data/ds601.0/RCPP/1995_2005/graupel/"
suffix=".nc" subdirs="true"/>
             </aggregation>
           </netcdf>
       </dataset>

       <dataset name="internal/ds601.0 Aggregation (olr)"
         ID="internal/ds601.0/103/olr"
         urlPath="internal/ds601.0/103/olr"

         harvest="true">
         <serviceName>all</serviceName>
         <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
             <aggregation dimName="time" type="joinExisting">
                <scan
location="/data/glade/p/rda/data/ds601.0/RCPP/1995_2005/olr/"
suffix=".nc" subdirs="true"/>
             </aggregation>
           </netcdf>
       </dataset>


       <dataset name="internal/ds601.0 Aggregation (psfc)"
         ID="internal/ds601.0/104/psfc"
         urlPath="internal/ds601.0/104/psfc"

         harvest="true">
         <serviceName>all</serviceName>
         <netcdf
xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2";>
             <aggregation dimName="time" type="joinExisting">
                <scan
location="/data/glade/p/rda/data/ds601.0/RCPP/1995_2005/psfc/"
suffix=".nc" subdirs="true"/>
             </aggregation>
           </netcdf>
       </dataset>

        ...
        ...
        ...

This sped things up immensely and the server is very responsive,
however, I can't seem to get the authorization to work with the
internal Union aggregation.

I have attempted a number of things, such as:

+
https://www.unidata.ucar.edu/software/thredds/current/tds/reference/RestrictedAccess.html
- 2. Restrict by Dataset using TDS Catalog
    for each joinExisting aggregation

+ Adding a valid username/password to the url in the netcdf location
value of the Union call
<aggregation type="Union">
               <netcdf
location="dods://USERNAME:PASSWORD@localhost:8080/thredds/dodsC/internal/ds601.0/101/glw"/>

+ trying the above with an http:// protocol

The only thing that seems to work is to leave the the joinExisting
aggregations unrestricted, but keep the restriction on the Union
aggregation.


I would like to do any of the following:

1) Hide the joinExisting aggregations (links) from the web browser

2) Since the joinExisting aggregations are only needed to populate
the Union aggregation "internally" to the TDS, somehow ease
restrictions when called within the TDS on the localhost

3) Somehow authorize the joinExisting aggregations within the Uinion
aggregation

4) Hear of an alternative way to efficiently aggregate the
timeseries parameters and then combine those aggregated timeseries.

If this is completely undo-able, that is also helpful information,
and I'll leave the aggregated timeseries (joinExisting) unrestricted.

-kevin.

--
Kevin Manross
NCAR/CISL/Data Support Section
Phone: (303)-497-1218
Email:manross@xxxxxxxx <mailto:manross@xxxxxxxx>
Web:http://rda.ucar.edu


_______________________________________________
thredds mailing list
thredds@xxxxxxxxxxxxxxxx
For list information or to unsubscribe,  visit:
http://www.unidata.ucar.edu/mailing_lists/


--
Kevin Manross
NCAR/CISL/Data Support Section
Phone: (303)-497-1218
Email:manross@xxxxxxxx <mailto:manross@xxxxxxxx>
Web:http://rda.ucar.edu



  • 2013 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: