[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: THREDDS/DLESE Connections slides





Peter Cornillon wrote:

Just to make sure i understand your terminology:

files = physical files


YUP


datasets = logical files we want the user to see


I don't think about datasets in a file concept. It could be a group of
files, a single file,... I guess that the reason that I don't think about it that way is that the data need not be in digital form to be
grouped in a data set. Beach profiles that have been collected over
the past 50 years and consist of pages of numbers - monthly values of
depth below mean low water at specified distances from a marker in a given direction would qualify. I suppose that your definition is correct from a computer perspective, I just don't think of it that way.


ok, i didnt really mean to use the word "file". how about:

"a dataset is a logical grouping of data, associated in some meaningful way from the user's perspective."

In a DODS server, a dataset is something you can get a DAS and DAP from.

in THREDDS, a "collection" is a collection of datasets, for which the above definition also works just fine. so whats the difference between a dataset and a collection? this is the same issue that Benno has pointed out: in his DODS server, there is no distinction between collections and datasets, because the server seamlessly moves between collections, physical files, and the fields in the files, presenting a uniform API of datasets with their DAP and DAS.

(I am not going to try to answer the question of what's the difference between a catalog and a collection yet; hopefully others might have some ideas)

in THREDDS, a dataset has a URI, and is the smallest choosable thing in the catalog. our goal as middleware is to present the list of dataset choices to the user very quickly, without having to actually contact the server. once the user selects a dataset, then the user can expect some delay while a connection is made to the server, and the "real" dataset metadata is collected. This implies that the catalog metadata may not be exactly right at all times (eg the list of available times of the dataset), which makes life easier for implementors.




inventory = listing of datasets


No, a listing of datasets is what I refer to as a directory (not a
directory on a computer). The GCMD is an example of same. An
inventory is a listing of elements in a data set, it could be a
list of times for satellite images in an archive along with the physical location of the data (tape C18341 on a rack, or N861230147.hat in a computer directory on my machine) or a list
of times and locations of each XBT in an XBT archive.


so is an inventory an internal thing that the server uses to construct the datasets that are visible to the outside world?



question:
what does it mean to "group files into data sets"? like the agg server?


One mightsay that all images in this projection, from this satellite,
processed this way form a data. Or one could say that all images in
this projection, from this suite of satellites processed this way form a data set. Or... This is the trouble with data sets, different
people call different groupings of the data a data set. This caused
a lot of blood letting between NASA and NOAA a number of years back.
The idea is NOT to call every granule or every file in the system a
data set, you know the difference between lumpers and splitters. In
order for us to make progress, we have to back off a bit and look at
the big picture, grouping things into data sets allows us to do that.
This is exactly the problem that the DODS crawler has. When it crawls
a site such as our satellite archive, it ends up with thousands of
entries and the system or the person viewing the results struggles with a data overload, more information that s/he/it (humm... have
to be careful with these gender neutral versions) wants or needs to
locate the group of files that define the object of interest. Given
that there is no precise definition for how to group files into a
data set, I think that we can reduce the amount of information that
we have to deal with to a reasonable view of the all the data on the
system without losing much if anything. The crawler is likely to group
the files slightly differently in some cases than the human would, but
one could probably discover this pretty quickly and steer the crawler
if necessary.


ok, this seems to be similar to the "collections" vs "datasets" issue above. I think i need to hear Steve's tech presentation before I can understand this any deeper.



Generating "inventories of granules in data sets" makes sense in the context of
an agg server, but is there also meaning to it in the context of a normal DODS
server?


Not sure exactly what you mean here. We have file servers which are inventories of granules in data sets. Actually the terminology is a
bit loose here also. The server in this case is a DODS FreeForm server.
It serves a table that contains a list of URLs with the characteristic(s)
that differentiate one URI from another, time in the case of our satellite
archives.


i think some of the problem is that i think of DODS narrowly as a specific client/server protocol, and you include services and extensions that have been built with or use that protocol.