Re: [thredds] Collection Metadata from Granules

To: Ted Habermann <Ted.Habermann@xxxxxxxx>
Subject: Re: [thredds] Collection Metadata from Granules
From: Pauline Mak <Pauline.Mak@xxxxxxxxxxx>
Date: Wed, 15 Apr 2009 12:48:04 +1000

Hi Ted,

Sorry, I'm going to ask some pretty dumb questions here, because I stillrather fuzzy on the THREDDS metadata, ISO and generally metadata stuff...


I'm also loosely involved in a project that is hoping to crawl datasets
through OPeNDAP and then recreating THREDDS catalogs with metadata

information (hence my initial question about running THREDDS overOPeNDAP servers) Where the end goal is to allow us to express our entiredigital library fully in ISO 19115. There's already been someinteraction between Simon and Jason :)

Content - My experience (could easily be incorrect) is that the THREDDScommunity has really focused on “use” metadata which tends to berelatively sparse (most importantly) and generally more customized. Thisreflects the emergence of THREDDS from the scientific community whichtraditionally shares that focus. As a result, I expect that thethreddsmetadata elements exist only in a small minority of catalogs.

Yeap, I hear you :) I am in the process of moving across to THREDDS asour main OPeNDAP servers, so hopefully, I can inject some more metadatainto those datasets!

On the topic of generating catalogs from filesystem... Simon - I'm notentirely sure how an external application like GeoNetwork is going tocreate configuration catalogs for THREDDS, the "location" attribute hasto point to actual files themselves and how these files are organised inthe underlying filesystem structures... Although, it is only a *single*attribute...

This situation is exacerbated by the evolution of THREDDS towardsauto-generation of catalogs from file systems. I’m fairly sure that thisprocess does not involve opening the files (for performance reasons) sometadata that might be in those files is generally not incorporated inthe catalog. I suspect that hand-hewn catalogs with lots of metadata arerare. BTW - I suspect that the same obvious (over-)generalizationapplies to the files that underlie most of these catalogs (again I haveno real quantitative evidence for this). There are a few groups outthere creating netCDF files with really high-quality metadata contentand that number may be growing, but it is still small. This reflects the


Indeed.  From experience, most data providers are keen to get the data
out there, and making datasets compliant to any metadata convention is

seen as a blocker rather than enabler (unfortunately), so a lot ofinformation is left out in the initial package.

fact that most creators and users of these files understand them prettywell and can generally use them successfully with information gleanedfrom conversations or scientific papers and presentations. The focus onhigh-quality standard metadata generally comes more from archives andgroups interested in the preservation of understanding. This is adifferent group.


Yes, again you've hit the nail on the head about this.

Identifiers/Collisions - The idea of unique identifiers exists in acouple of the metadata standards we are generally thinking about (DIF,FGDC-RSE, ISO) but it could easily be stretched in this situation. Forexample, if we agree that 1) there is a complete collection of metadataobjects associated with a dataset and identified by an identifier and 2)that the content of some subset of those objects is included in aTHREDDS Catalog, how is the identifier for that subset related to theidentifier of the complete set? How might collisions between theseidentifiers be handled during the harvest process?

If I understand correctly - the problem is that these identifiers arenot always generated by some central repository, but rather from thedata providers/repository providers. Just looking at the OAI spec again- it says that "Individual communities may develop community-specificURI schemes for coordinated use across repositories". So does anyoneknow if such thing exists?

The collision question also comes up if we consider where the metadataare coming from. Slide 6 in the presentation I sent out yesterday showsmetadata emerging from a Collection Metadata Repository and beingwritten into the granules. In our architecture, this is also the sourceof metadata that might be written into a THREDDS Catalog and harvestedto GeoNetwork, Geospatial One-Stop, GEOSS, GCMD, ... . It is also theauthoritative source for complete up-to-date information pointed to bythe URLs in the granules. Harvesters need to identify and manage thesepotential collisions. This seems potentially very messy to me.

I'm looking at slide 6 at the moment and have a question... How does itdeal with datasets that are continuely updated? For example, we updatedthe Argo dataset on a weekly basis through rsync. The ncML files willneed to be updated. Furthermore, this will introduce a lag between thecontent of the file and the ncML file. I'm more in favour of generatingthe data dependent figures from the file itself... (bad for performance,but at least the metadata will always be relevant.)

Mappings - This problem really boils down to the flatness inherent inthe netCDF attribute model which has a parameter-value rather than anXML-DTD or schema history. This model only includes one level ofcontainers so the only way to associate multiple attributes is to holdthem together with a variable (I’m pretty sure on this but am not asexperienced with netCDF as many others on this list).

I've had trouble getting my head around this problem too! The THREDDScatalog (rather than the metadata in NetCDF files) is a bit richer,since metadata can be inherited by nested datasets. So it should solve*some* of the flatness issues.

As I said yesterday, we created the NCML from FGDC using an XSLT, so weknow the mapping we used. It is fairly easy to reflect that mapping inthe names of the elements from FGDC because the structure of FGDC issimple and fairly flat. This is not true for ISO, as you know. There aremany embedded objects with potentially repeating objects in them. Ofcourse, you could name them with xpaths but this seems like a difficultpath to go down, particularly when you have a link to a complete, valid,and up-to-date record available.
Consider the relatively clean and important problem of OnlineResources.In THREDDS land these are strings (like fgdc_metadata_link) whosefunctions are known by convention: this one is a link to a complete fgdcrecord. Most of our datasets have multiple OnlineResources with completeISO descriptions (linkage, name, description, ...). Writing those into anetCDF file without losing critical information does not seemstraightforward to me. This is a really simple case. I don’t even wantto think about writing ISO Citations into netCDF or THREDDS!
Anyway, long-winded as I promised. All things considered I think theapproach I suggested provides maximum bang for the buck! We will work onadding the ISO links...

Sorry for being a bit slow here... So are you creating the ISO documentsseparately, and then reference them using something likeiso_metadata_link in a NetCDF file's attribute? How is the ISOdocuments generated in the first place (is this the NetCDF writer?) andwhere will they be hosted? I suppose, those files can also be hosted bythe THREDDS server if we only configure the http service for thosefiles? I do like how clean this approach is by simply adding an newattribute to the metadata. It sounds very achievable!


Cheers,

-Pauline.

--
Pauline Mak

ARCS Data Services
Ph: (03) 6226 7518
Email: pauline.mak@xxxxxxxxxxx
Jabber: pauline.mak@xxxxxxxxxxx
http://www.arcs.org.au/

TPAC
Email: pauline.mak@xxxxxxxxxxx
http://www.tpac.org.au/

Follow-Ups:
- Re: [thredds] Collection Metadata from Granules
  - From: Jason Lohrey
- Re: [thredds] Collection Metadata from Granules
  - From: John Caron

References:
- [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Pauline Mak
- Re: [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Simon.Pigot
- Re: [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Pauline Mak
- Re: [thredds] Running THREDDS on top of old OPeNDAP servers
  - From: Simon.Pigot
- [thredds] Collection Metadata from Granules
  - From: Ted Habermann
- Re: [thredds] Collection Metadata from Granules
  - From: Simon Pigot
- Re: [thredds] Collection Metadata from Granules
  - From: Ted Habermann

2009 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: