Catalog Notes


Notes for follow-on to 1.0



For 0.7

Simplify service/access model?

Does this relate to the whole service typing issue? Probably since it would be nice to know what info is needed at the service level and at the dataset(access) level before figuring out how to encode it.

What we want to support:

  1. Many datasets reference one service/access.
  2. One or more datasets have the same set of services where the ending of the URL is the same for each service.
  3. A dataset has multiple service/access methods but the URLs end with different strings
  4. A dataset has a set of services/accesses that no other dataset in the catalog has. (contain set of access elements)
  5. Other cases?

Can we develop a simpler model for 0.7 that keeps the simple/common things simple but allows for more complex cases?

Information represented by 0.6 service/access:

How about for 0.7 we have services and compound services but a dataset has an access method only when it contains an access element?

Benno's suggestions

http://iridl.ldeo.columbia.edu/dochelp/topics/MIRROR/suggestions.html

Dataset names

Benno: Allow specification of both a name and a long_name. Most of the THREDDS catalogs being generated are ill-suited to language-based clients. I of course have rewritten my code to handle the awkward names, but it seems pretty silly seeing as all the data providers have site-unique short names in the first place.For a language-based client, there is more to it than that. There is the long_name that gets displayed on output, the short name that is used to refer to that dataset in expressions and commands or as part of a much longer name that includes the names at higher levels of the tree, (e.g. sst is fully referend to as IRIDL SOURCES AC smoothed sst while its long_name is sea surface temperature), and the URL is used by the software to access the data.

Coherency

Jeff: For example, say Raj is putting together a catalog that holds data for labs in a met. class.One of the entries is a set of 5 radar images. Each of those images is a url but Raj wants to define the set of 5 as a cohesive set. He does not want the students to see all 5 time urls in the catalog chooser, rather he just wants to see a link.

John: In a catalog, we started off having "collections", but the distinction betweeen collection and dataset was blurry, so we decided to just call everything a dataset. A dataset can have a URL, can have nested datasets, or both. The idea is that it would have its own URL if it was "cohesive", but that requires support on the data server.

One assumes that datasets are collected together for some good reason, but its not known what that reason is. So what exactly would "CompositeDataset" mean? I would be more inclinde to be more specific, like "CompositeDataset_TimeSeries" etc.

Given such a collection, can a client figure out what to do with it? The DODS Aggregation Server (AS) faced a similar question. Your use case probably corresponds to this AS use case:

The JoinNew aggregation type "joins" variables along a new dimension. The dimension and a coordinate variable is created and values for the coordinates are specified in the aggregation element.

The other AS use cases are "Union" and "JoinExisting". The main issue of JoinNew is to identify the coordinate variables of the new dimensions, ie how do you know what the time value is for each URL? The AS just makes the server configurer explicitly specify them; one could do more elegent things, esp if you can rely on identifying a time coord variable. That is however, "service protocol specific" from the catalog POV.

So a CompositeDataset_TimeSeries tag could be all that a smart client needs to do the right thing, and it is certainly a common case. We could possibly add a tag to identify the time coordinate or the variable with the time coordinate in it. It may not be possible to be more precise about what the right thing is, except in a data model / protocol dependent way.

What other kinds of coherency might there be?

Ethan: Jeff and I were just talking about this in the hallway. The similarity between a cohesive dataset collection and the agg types came to mind for me as well. A few
use cases that come to mind:

To me the Agg "Union", "JoinNew", and "JoinExisting" describe the syntax of how to make things cohesive where the "CompositeDataset_*" types are the semantics of what the cohesive whole means.

Perhaps too soon to think much about how to encode this in a THREDDS catalog but my initial thought is to encode it as a kind of proxy service/access. The service type could be "CompositeDataset_*" with the "Union|Join*" info in a property element or something but no URL information.

 

Other


URL Construction

Catalogs have to unambiguously specify a dataset. This means that there must be enough info and a set of clear rules on how to access the dataset. These will be specific to each service-protocol.