Re: [thredds] enrich thredds xml catalog using external tool

  • To: "Antonio S. Cofino" <cofinoa@xxxxxxxxx>
  • Subject: Re: [thredds] enrich thredds xml catalog using external tool
  • From: Chiara Scaini <saetachiara@xxxxxxxxx>
  • Date: Fri, 20 Jul 2018 11:02:02 +0200
Hi Antonio,
actually I'm thinking that it may be a good idea to write something similar
to "<addTimeCoverage datasetNameMatchPattern" that allows me to add a
different metadata based on a regex on filename (which contains the date).
That would allow me to use the datasetscan (which is more reliable than
writing my own xml catalog) and enrich specific entries based on date.
Where is the source code for the addTimeCoverage and addDatasetSize? I
looked for it in github but could not find it.

Also, I checked the thredds cache and it's empty. Do you know if the final
catalog resulting from the datasetscan is stored somewhere in the server? I
could *wget* it but since the files are nested, I would never get the
complete catalog tree... If I had the complete catalog I could modify it
and add, for example. the harvesting attribute based on the current date.
The result would be something like this (2 nested folders and the data, and
the harvesting flag added by a python script):

   <dataset name="WRF 2018" ID="testWRF/2018"><metadata
inherited="false"><keyword>Parent</keyword></metadata><metadata
inherited="true"><serviceName>all</serviceName><dataType>GRID</dataType><documentation
type="summary">This is a summary for my test ARPA catalog for WRF runs.
Runs are made at 12Z and 00Z, with analysis an d forecasts every 6 hours
out to 60 hours. Horizontal = 93 by 65 points, resolution 81.27 km,
LambertConformal projection. Vertical = 1000 to 100 hPa pressure
levels.</documentation><keyword>WRF
outputs</keyword><geospatialCoverage><northsouth><start>25.0</start><size>35.0</size><units>degrees_north</units></northsouth><eastwest><start>-20.0</start><size>50.0</size><units>degrees_east</units></eastwest><updown><start>0.0</start><size>0.0</size><units>km</units></updown></geospatialCoverage><timeCoverage><end>present</end><duration>5
years</duration></timeCoverage><variables vocabulary="GRIB-1"/><variables
vocabulary=""><variable name="Z_sfc" vocabulary_name="Geopotential H"
units="gp m">Geopotential height, gpm</variable></variables></metadata>
   <dataset name="WRF 2018-03-19T00:00:00"
ID="testWRF/2018/20180319_00"><metadata
inherited="false"><keyword>Parent</keyword></metadata>
      <dataset name="WRF Domain-03 2018-03-23T00:00:00"
ID="testWRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"
urlPath="WRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"><dataSize
units="Mbytes">137.2</dataSize><date
type="modified">2018-06-28T10:27:07Z</date><timeCoverage><start>2018-03-23T00:00:00</start><duration>6
hours</duration></timeCoverage><keyword>Children</keyword></dataset>
      <dataset name="WRF Domain-03 2018-03-20T18:00:00"
ID="testWRF/2018/20180319_00/wrfout_d03_2018-03-20_18:00:00"
urlPath="WRF/2018/20180319_00/wrfout_d03_2018-03-20_18:00:00"><dataSize
units="Mbytes">137.2</dataSize><date
type="modified">2018-06-28T10:27:13Z</date><timeCoverage><start>2018-03-20T18:00:00</start><duration>6
hours</duration></timeCoverage><keyword>Children</keyword></dataset>
      <dataset name="WRF Domain-02 2018-03-20T00:00:00"
ID="testWRF/2018/20180319_00/wrfout_d02_2018-03-20_00:00:00"
urlPath="WRF/2018/20180319_00/wrfout_d02_2018-03-20_00:00:00"><dataSize
units="Mbytes">472.4</dataSize><date
type="modified">2018-06-28T10:27:01Z</date><timeCoverage><start>2018-03-20T00:00:00</start><duration>6
hours</duration></timeCoverage><keyword>Children</keyword></dataset>
      <dataset name="WRF Domain-01 2018-03-23T00:00:00"
ID="testWRF/2018/20180319_00/wrfout_d01_2018-03-23_00:00:00"
urlPath="WRF/2018/20180319_00/wrfout_d01_2018-03-23_00:00:00"><dataSize
units="Mbytes">101.9</dataSize><date
type="modified">2018-06-28T10:26:57Z</date><timeCoverage><start>2018-03-23T00:00:00</start><duration>6
hours</duration></timeCoverage><keyword>Children</keyword></dataset>
      <dataset name="WRF Domain-01 2018-03-20T00:00:00"
ID="testWRF/2018/20180319_00/wrfout_d01_2018-03-20_00:00:00"
urlPath="WRF/2018/20180319_00/wrfout_d01_2018-03-20_00:00:00"><dataSize
units="Mbytes">101.9</dataSize><date
type="modified">2018-06-28T10:27:10Z</date><timeCoverage><start>2018-03-20T00:00:00</start><duration>6
hours</duration></timeCoverage><keyword>Children</keyword>
*<harvest>true</harvest>*</dataset>
   </dataset>
   </dataset>

Many thanks,
Chiara

On 19 July 2018 at 15:51, Chiara Scaini <saetachiara@xxxxxxxxx> wrote:

> Hi Antonio, thanks for answering!
> The easiest thing for me would be using a python script that reads the
> data from the database and modifies the xml (ex. using lxml library,
> https://lxml.de/).
>
> Would namespaces be used similarly to this simple example? I just added a
> test node 'mycustomfield' to a thredds catalog dataset entry.
> <catalog version="1.0.1><service ..... />
> <dataset name="WRF Domain-03 2018-03-23T00:00:00"
> ID="testWRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"
> urlPath="WRF/2018/20180319_00/wrfout_d03_2018-03-23_00:00:00"><dataSize
> units="Mbytes">137.2</dataSize><date type="modified">2018-06-28T10:
> 27:07Z</date><timeCoverage><start>2018-03-23T00:00:00</start><duration>6
> hours</duration></timeCoverage><myns:mycustomfield xmlns:myns="myurl">My
> custom stuff</myns:mycustomfield></dataset>
> </catalog>
>
> Regarding the catalog: I can't modify single datasets in the catalog.xml
> because it only contains the datasetscan. Some metadata can be added for
> all entries at the datasetscan level, but others are specific (ex. if the
> specific file was archived or not, and when). Is it possible to enable
> something that writes the catalog in a temp file of some kind? What do you
> mean by: "catalog entries generated by datasetScan are created in-memory
> and they are cached/persisted (???) In specific storage format."? If it's
> cached, I should be able to retrieve it somehow.
>
> As for the dynamic catalog, the documentation says *'Dynamic catalogs are
> generated by DatasetScan
> <https://www.unidata.ucar.edu/software/thredds/current/tds/reference/DatasetScan.html>
> elements, at the time the user request is made. These catalogs are not
> cached'* so, it I understood correctly, I can't create a text file out of
> it and modify it.
>
>
> Thanks,
> Chiara
>
> On 19 July 2018 at 14:47, Antonio S. Cofino <cofinoa@xxxxxxxxx> wrote:
>
>> Hola Chiara,
>>
>> To "enrich" the TDS catalog you can use XML namespaces [1]. That allows
>> to use more than one XML schema.
>>
>> In fact the TDS already uses it for the datasetScan to point using XLink
>> schema to the directories been scanned.
>>
>> With respect to the datasetScan feature to create a proxy of an absolute
>> latest atomic dataset, it would require to create an new datasetScan
>> element.
>>
>> Meanwhile you can create/modify the catalogs using an external tool.
>>
>> I would recommend use a tool/library which is XML "aware" to guarantee
>> well formed and semantically correct XML documents, but using other tool
>> would fit your purpose.
>>
>> Take into account that catalog entries generated by datasetScan are
>> created in-memory and they are cached/persisted (???) In specific storage
>> format.
>>
>> One interesting feature in the TDS5.0 version are the dynamic catalogs,
>> similar to a catalogScan. But it has not been officially released but the
>> current beta version already implements it.
>>
>> Antonio S. Cofino
>>
>>
>> [1] https://www.w3schools.com/xml/xml_namespaces.asp
>>
>>
>>
>> On 19 Jul 2018 12:37, "Chiara Scaini" <saetachiara@xxxxxxxxx> wrote:
>>
>> Hi all, I'm setting up a thredds catalog to be used by Geonetwork.
>>
>> The catalog contains meteorological data, but will be enriched by other
>> data sources (ex. a table containing the list of records that were moved to
>> a backup facility and are no longer available on disk, or a table
>> containing pictures related to the files).
>>
>> Is it possible to enrich the xml file with other data (ex. inserting xml
>> nodes directly into the file) without breaking thredds functionalities?
>> What strategy do you recommend (ex. a bash script to modify the xml,
>> or...?).
>>
>> Note that I'm using a <datasetScan> to recursively get all items in a
>> nested folder structure, so I would like to modify the 'real' xml catalog
>> that contains all the nodes (some information should to be inserted at the
>> container level, others at the data level).
>>
>> Many thanks,
>> Chiara
>>
>>
>>
>> --
>> Chiara Scaini
>> _______________________________________________
>> NOTE: All exchanges posted to Unidata maintained email lists are
>> recorded in the Unidata inquiry tracking system and made publicly
>> available through the web.  Users who post to any of the lists we
>> maintain are reminded to remove any personal information that they
>> do not want to be made public.
>>
>>
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit:
>> http://www.unidata.ucar.edu/mailing_lists/
>>
>>
>>
>
>
> --
> Chiara Scaini
>



-- 
Chiara Scaini
  • 2018 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: