Catalog Notes
Notes for follow-on to 1.0
- Add datasetRef element that references a dataset element in
an external catalog. Similar to catalogRef but uses XPath (just ID part
of XPath?) to reference a dataset element in the catalog, e.g.,
http://server/catalog.xml#myFavDataset.
- Make the vocabulary_name and units attributes of the variable
element children of the variable element and allow the vocabulary
attribute on both new elements. Possibly override/inherit the
vocabulary attribute in the variables element. Could also allow
multiple vocabulary_name elements so that dataset variable could be
mapped to multiple controlled vocabularies in one place. This was
discussed on the thredds email list around 1 October 2004 (Subject:
v1.0 and vocabularies):http://www.unidata.ucar.edu/support/help/MailArchives/thredds/msg00336.html
- Think about adding a type to catalogRef so that we can point to other types of catalogs (from JeffMc)
- Add expire and other cacheing type info for datasets as well as catalogs
- Rethink how the date/range and dateEnumTypes in the XSD are defined/allowed.
Some of the types are appropriate for dates and others for date ranges.
For 0.7
- Rethink the <catalog><dataset/></catalog> structure and it's interaction with catalogRef elements. Spec says
"The
dataset chooser software should seamlessly present a catalogRef to the
user, for example by eliminating the referenced catalog's top-level
dataset in its presentation of the catalog when its name matches the
title of the catalogRef title attribute."
But the TDV seems to always remove the top-level dataset even if names
don't match (but changes from the "catalogRef.xlink:title" to the
"dataset.name"). What are we really trying to model with these
structures?
- Rethink service type model. Look more at ADEPT access model.
-
EMSL seems like it fits as both a metadata record and as a dataset
service type. I don't understand this fully but isn't an EMSL client
able to access the referenced dataset if it has the EMSL file?
- Rethink data type model.
-
How indicate to a metadata harvesting tool (say OAI server) that a
particular metadata record should be harvested by a particular digital
library. Also may need to know that DLs ID for that metadata record.
- Resolver service (DQC server?): need serviceType="Resolver".
- Figure out how to deal with DTD/Schemas for InvCatalog and for multiple metadata types at same time.
- [Actually
for Java API] rethink implementation and how API exposes impl in terms
of going both from the XML to an object model as well as from an object
model to XML.
Simplify service/access model?
Does this relate to the whole service typing issue? Probably since
it would be nice to know what info is needed at the service level and
at the dataset(access) level before figuring out how to encode it.
What we want to support:
- Many datasets reference one service/access.
- Done
in 0.6: Use serviceName to reference a service element and append
urlPath to services base URL. The value of serviceName comes from
closest self/ancestor that has a serviceName.
- For 0.7: ???
- One or more datasets have the same set of services where the ending of the URL is the same for each service.
- Done in 0.6: Reference a service of type "Compound" (but urlPath has to be the same for all sub-services)
- For 0.7: ???
- A dataset has multiple service/access methods but the URLs end with different strings
- Done
in 0.6: Each child access element represents a service/access for the
parent dataset. An access element can reference an existing service
element or can provide a fully qualified URL and a serviceType.
- For 0.7: ???
- A dataset has a set of services/accesses that no other dataset in the catalog has. (contain set of access elements)
- Done in 0.6: Either create and reference a compound service or contain a set of child access elements.
- For 0.7: ???
- Other cases?
- Done in 0.6: ???
- For 0.7: ???
Can we develop a simpler model for 0.7 that keeps the simple/common things simple but allows for more complex cases?
Information represented by 0.6 service/access:
- Service type (@serviceType from service or access)
- Base URL (@base from service [ unless access gives absolute URL])
- String to add to base URL (@urlPath from dataset or access)
- Suffix to be appended to base URL ?before or after the urlPath? (@suffix from service)
- Which service/access a dataset uses
- dataset
references a service with @serviceName (use value of @serviceName from
the closest self/ancestor dataset that has a @serviceName
- access element references a service with @serviceName
How about for 0.7 we have services and compound services but a
dataset has an access method only when it contains an access element?
Benno's suggestions
http://iridl.ldeo.columbia.edu/dochelp/topics/MIRROR/suggestions.html
Dataset names
Benno: Allow specification of both a name and a long_name. Most of the
THREDDS catalogs being generated are ill-suited to language-based clients. I
of course have
rewritten my code to handle the awkward names, but it seems pretty silly seeing
as all the data providers have site-unique short names in the first place.For
a language-based client, there is more to it than that. There is the long_name
that gets displayed on output, the short name that is used to refer to that
dataset in expressions and commands or as part of a much longer name that includes
the names at higher levels of the tree, (e.g. sst is fully referend to as IRIDL
SOURCES AC smoothed sst while its long_name is sea surface temperature), and
the URL is used by the software to access the data.
Coherency
Jeff: For example, say Raj is putting together a catalog that holds
data for labs in a met. class.One of the entries is a set of 5 radar images.
Each of those images is a url but Raj wants to define the set of 5 as a cohesive
set. He does not want the students to see all 5 time urls in the catalog chooser,
rather he just wants to see a link.
John: In a catalog, we started off having "collections", but
the distinction betweeen collection and dataset was blurry, so we decided to
just call everything a dataset. A dataset can have a URL, can have nested datasets,
or both. The idea is that it would have its own URL if it was "cohesive",
but that requires support on the data server.
One assumes that datasets are collected together for some good reason, but
its not known what that reason is. So what exactly would "CompositeDataset"
mean? I would be more inclinde to be more specific, like "CompositeDataset_TimeSeries"
etc.
Given such a collection, can a client figure out what to do with it? The DODS
Aggregation Server (AS) faced a similar question. Your use case probably corresponds
to this AS use case:
The JoinNew aggregation type "joins" variables along a new
dimension. The dimension and a coordinate variable is created and values for
the coordinates are specified in the aggregation element.
The other AS use cases are "Union" and "JoinExisting".
The main issue of JoinNew is to identify the coordinate variables of the new
dimensions, ie how do you know what the time value is for each URL? The AS just
makes the server configurer explicitly specify them; one could do more elegent
things, esp if you can rely on identifying a time coord variable. That is however,
"service protocol specific" from the catalog POV.
So a CompositeDataset_TimeSeries tag could be all that a smart client needs
to do the right thing, and it is certainly a common case. We could possibly
add a tag to identify the time coordinate or the variable with the time coordinate
in it. It may not be possible to be more precise about what the right thing
is, except in a data model / protocol dependent way.
What other kinds of coherency might there be?
Ethan: Jeff and I were just talking about this in the hallway. The similarity
between a cohesive dataset collection and the agg types came to mind for me
as well. A few
use cases that come to mind:
- CompositeDataset_TimeSeries (or _Series, where the series can be along non-time
axes): a series of items (e.g., points, grids, images) monotonic on some axes.
Example: latest 5 radar images from single station
- CompositeDataset_Station (or _Point, where the points could be on 1D, 2D,
3D,...): a set of point items in some space Example: all profiler stations
at one time
- CompositeDataset_Field: a set of fields that occupy same space Example:
Rolands single parameter datasets all on same grid
To me the Agg "Union", "JoinNew", and "JoinExisting"
describe the syntax of how to make things cohesive where the "CompositeDataset_*"
types are the semantics of what the cohesive whole means.
Perhaps too soon to think much about how to encode this in a THREDDS catalog
but my initial thought is to encode it as a kind of proxy service/access. The
service type could be "CompositeDataset_*" with the "Union|Join*"
info in a property element or something but no URL information.
Other
- XML character encoding. Benno is using ê and é
What is the right way to handle this?
- use the predefined "entity references" for the following
chars:
< <
> >
" "
' '
& &
- use the numeric
code for other special chars
- DataType optional, what should TDV do? Benno not using.
- Client handling of multiple services for the same dataset. Do you
present a choice to the user? Then you need a way to distinguish the choices
to present to the user. Do we just use the ServiceType, or should we add a
human-readable name to the service for display to the user? How does
DODS deal with this vis-a-vis translator services?
- Clients are not able to deal with all ServiceTypes. We should provide
the functionality that a client specifies what ServiceTypes it can handle,
and the choice selector should eliminate the ones that it cannot. Looking
ahead to a Catalog Server, this evolves into a filtering operation on ServiceType.
URL Construction
Catalogs have to unambiguously specify a dataset. This means that there
must be enough info and a set of clear rules on how to access the dataset.
These will be specific to each service-protocol.
- DODS: construct the dataset URL
- url = serverBase + datasetPath + (".dds" | ".das")
- But if you want to use a constraint expression (CE), you need a more
sophisticated rule:
- 1) url = serverBase + datasetPath
- 2) if "?" exists, then insert (".dds" | ".das") just before the
"?" else append it
- Proposal: add a "suffix" attribute to the service element: url = serverBase
+ datasetPath + serverSuffix + (".dds>" | ".das")
- ADDE: construct the dataset URL
- url = serverBase + "/imagedata?" + datasetPath
- optionally use "accessPath" info from the datasetDesc: url =
serverBase + "/imagedata?" + datasetPath + accessPath1 + accessPath2 + ...
- NetCDF: construct the dataset URL, no mods are needed
- url = serverBase + datasetPath
- Others TBD