Re: [thredds] How to create / load aggregation cache

To: thredds@xxxxxxxxxxxxxxxx
Subject: Re: [thredds] How to create / load aggregation cache
From: Antonio S. Cofiño <cofinoa@xxxxxxxxx>
Date: Mon, 19 Jun 2023 18:18:41 +0200

Hi Jordi,

Please, you can find inline some comments and questions.



On 19/06/2023 16:51, Jordi Domingo Ballesta wrote:

Dear TDS team,
I would like to know if it is possible to (pre-)create the aggregationcache and make thredds load it, in order to speed up the first time adataset is requested.
To give a bit of context, our situation is the following:
- We have a big archive of 265TB of data and 5 million files,distributed in 1000 datasets (aprox).

Pretty challenge!!

This is just one catalog with 1000 datasets in that? If this is thecase, that is a big catalog to be parsed by server and client. If not,what is the granularity of your catalogs? how many catalogs? and/or howmany datasets per catalog? how many files per dataset/aggregation?

- These datasets are in NetCDF format (mostly v4, some v3).

OK

- We run TDS version 5.4.

OK

- We configured thredds to provide access to them via "http" and"odap" services, both directly (with "datasetScan") and as aggregateddatasets.

Only DAP (i.e. odap) service it's appropriate for aggregations. HTTP andDAP it's possible for files.

- The configuration needs to be updated regularly (at least every day)as new files come while others are deleted.- We have serious performance issues regarding the access ofaggregated datasets, especially the first time they are accessed.

yes!! this can be a challenge

In order to improve that, we tried configuring the catalogs with theexplicit list of files for each dataset, including the "ncoords"field, or even the "coordValue" field with the time value of each file(they are joinExisting aggregations based on time dimension). Thatimproved substantially the performance of the first access, but theduration is still not "acceptable" by the users.


This the best first option.

What means "not acceptable" .... or more exactly what is "acceptable" ...

This first approach should be enough, because you are using netCDF3/4,but I'm curious about how many files are in each "dataset".

I tried to pre-create the cache files in thredds/cache/aggNew/directory with the same content as when they are created by thredds,but it seems that thredds is ignoring them when loading, and justrecreating its own version again. I also noticed that the cachedatabase in thredds/cache/catalog/ directory plays a role as well, butI do not understand the relation between that and the aggregationcache files.

This could be a good approach but it's tricky ... I have never managedto make it work, as you just learnt.

Anyway, do you recommend any practice in order to improve theperformance of thredds for the first time a dataset is accessed? Maybethrowing a 1-time request for the time variables to each dataset inorder to force thredds to create and load the cache?

Because you want to roll-up the dataset, I would follow a differentapproach. Create independent NCML files with-in aggregations (withdatasetScand, for example, or explicit files), independent from thecatalog file, and pre-fill all metadata and coordinates. From catalogfiles, you only have to refer those ncml files, or add a datasetScanFrom my experience this the best approach because you will have better"control" what happens with aggregations.


Please let me know if you need some clarifications.

Antonio

--
Antonio S. Cofiño
Instituto de Física de Cantabria (IFCA, CSIC-UC)
Consejo Superior de Investigaciones Científicas
http://antonio.cofino.es
#PublicMoneyPublicCode
#DocumentFreedomDay

Your help is very appreciated. Many thanks!

Kind regards,

*Jordi Domingo*
Senior software engineer
Lobelia Earth, S.L.

_______________________________________________
NOTE: All exchanges posted to Unidata maintained email lists are
recorded in the Unidata inquiry tracking system and made publicly
available through the web.  Users who post to any of the lists we
maintain are reminded to remove any personal information that they
do not want to be made public.


thredds mailing list
thredds@xxxxxxxxxxxxxxxx

For list information or to unsubscribe, visit:https://www.unidata.ucar.edu/mailing_lists/


--
Antonio S. Cofiño
Instituto de Física de Cantabria (IFCA, CSIC-UC)
Consejo Superior de Investigaciones Científicas
http://antonio.cofino.es
#PublicMoneyPublicCode
#DocumentFreedomDay

References:
- [thredds] How to create / load aggregation cache
  - From: Jordi Domingo Ballesta

2023 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: