Peter Cornillon wrote:
Just to make sure i understand your terminology:
files = physical files
datasets = logical files we want the user to see
I don't think about datasets in a file concept. It could be a group of
files, a single file,... I guess that the reason that I don't think
about it that way is that the data need not be in digital form to be
grouped in a data set. Beach profiles that have been collected over
the past 50 years and consist of pages of numbers - monthly values of
depth below mean low water at specified distances from a marker in a
given direction would qualify. I suppose that your definition is
correct from a computer perspective, I just don't think of it that way.
ok, i didnt really mean to use the word "file". how about:
"a dataset is a logical grouping of data, associated in some meaningful way from
the user's perspective."
In a DODS server, a dataset is something you can get a DAS and DAP from.
in THREDDS, a "collection" is a collection of datasets, for which the above
definition also works just fine. so whats the difference between a dataset and a
collection? this is the same issue that Benno has pointed out: in his DODS
server, there is no distinction between collections and datasets, because the
server seamlessly moves between collections, physical files, and the fields in
the files, presenting a uniform API of datasets with their DAP and DAS.
(I am not going to try to answer the question of what's the difference between a
catalog and a collection yet; hopefully others might have some ideas)
in THREDDS, a dataset has a URI, and is the smallest choosable thing in the
catalog. our goal as middleware is to present the list of dataset choices to the
user very quickly, without having to actually contact the server. once the user
selects a dataset, then the user can expect some delay while a connection is
made to the server, and the "real" dataset metadata is collected. This implies
that the catalog metadata may not be exactly right at all times (eg the list of
available times of the dataset), which makes life easier for implementors.
inventory = listing of datasets
No, a listing of datasets is what I refer to as a directory (not a
directory on a computer). The GCMD is an example of same. An
inventory is a listing of elements in a data set, it could be a
list of times for satellite images in an archive along with the
physical location of the data (tape C18341 on a rack, or
N861230147.hat in a computer directory on my machine) or a list
of times and locations of each XBT in an XBT archive.
so is an inventory an internal thing that the server uses to construct the
datasets that are visible to the outside world?
what does it mean to "group files into data sets"? like the agg server?
One mightsay that all images in this projection, from this satellite,
processed this way form a data. Or one could say that all images in
this projection, from this suite of satellites processed this way
form a data set. Or... This is the trouble with data sets, different
people call different groupings of the data a data set. This caused
a lot of blood letting between NASA and NOAA a number of years back.
The idea is NOT to call every granule or every file in the system a
data set, you know the difference between lumpers and splitters. In
order for us to make progress, we have to back off a bit and look at
the big picture, grouping things into data sets allows us to do that.
This is exactly the problem that the DODS crawler has. When it crawls
a site such as our satellite archive, it ends up with thousands of
entries and the system or the person viewing the results struggles
with a data overload, more information that s/he/it (humm... have
to be careful with these gender neutral versions) wants or needs to
locate the group of files that define the object of interest. Given
that there is no precise definition for how to group files into a
data set, I think that we can reduce the amount of information that
we have to deal with to a reasonable view of the all the data on the
system without losing much if anything. The crawler is likely to group
the files slightly differently in some cases than the human would, but
one could probably discover this pretty quickly and steer the crawler
ok, this seems to be similar to the "collections" vs "datasets" issue above. I
think i need to hear Steve's tech presentation before I can understand this any
Generating "inventories of granules in data sets" makes sense in the context of
an agg server, but is there also meaning to it in the context of a normal DODS
Not sure exactly what you mean here. We have file servers which are
inventories of granules in data sets. Actually the terminology is a
bit loose here also. The server in this case is a DODS FreeForm server.
It serves a table that contains a list of URLs with the characteristic(s)
that differentiate one URI from another, time in the case of our satellite
i think some of the problem is that i think of DODS narrowly as a specific
client/server protocol, and you include services and extensions that have been
built with or use that protocol.