John Caron wrote:
> Peter Cornillon wrote:
> >>we expect that data holdings can be divided into two categories. 1) sites in
> >>which the monitoring (eg crawling) can be done occasionally (once a day,
> >>once an
> >>hour, once a week?), and the impact of the crawling is therefore minimal. 2)
> >>real-time sites that have constantly changing data. For these, we probably
> >>a different strategy, and we are considering instrumenting the LDM as one
> >>possible solution.
> > But in sites that are being continuously updated, it seems to me
> > that you need a local inventory, a file or some other way of
> > keeping track of the contents of a data set. This is our notion
> > of a file server or your configuration file in the Aggregation
> > Server. This is the thing that you want to discover when searching
> > for data sets, not all of the files (or granules or whatever) in
> > the data set. This is what we are wrestling with in the crawler that
> > we are looking at. In particular, I have asked Steve to look at
> > ways of having the crawler group files into data sets automatically
> > and then to reference the inventory for the data set rather than
> > the entire data set and to make the crawler capable of updating
> > the inventory.
> Just to make sure i understand your terminology:
> files = physical files
> datasets = logical files we want the user to see
I don't think about datasets in a file concept. It could be a group of
files, a single file,... I guess that the reason that I don't think
about it that way is that the data need not be in digital form to be
grouped in a data set. Beach profiles that have been collected over
the past 50 years and consist of pages of numbers - monthly values of
depth below mean low water at specified distances from a marker in a
given direction would qualify. I suppose that your definition is
correct from a computer perspective, I just don't think of it that way.
> inventory = listing of datasets
No, a listing of datasets is what I refer to as a directory (not a
directory on a computer). The GCMD is an example of same. An
inventory is a listing of elements in a data set, it could be a
list of times for satellite images in an archive along with the
physical location of the data (tape C18341 on a rack, or
N861230147.hat in a computer directory on my machine) or a list
of times and locations of each XBT in an XBT archive.
> granule = ??
This word is starting to lose its meaning. It used to refer to
the smallest physically distinct element in a dataset. A CDROM
is a good example. You don't get part of the CD. A file on a
system is another example. Admittedly you could break the file
up (and we do in DODS), but most people would still see a file
as a granule. It made sense in the old days when things were moved
around on as files on tapes. It referred the smallest thing that
was readily subsettable.
> what does it mean to "group files into data sets"? like the agg server?
One mightsay that all images in this projection, from this satellite,
processed this way form a data. Or one could say that all images in
this projection, from this suite of satellites processed this way
form a data set. Or... This is the trouble with data sets, different
people call different groupings of the data a data set. This caused
a lot of blood letting between NASA and NOAA a number of years back.
The idea is NOT to call every granule or every file in the system a
data set, you know the difference between lumpers and splitters. In
order for us to make progress, we have to back off a bit and look at
the big picture, grouping things into data sets allows us to do that.
This is exactly the problem that the DODS crawler has. When it crawls
a site such as our satellite archive, it ends up with thousands of
entries and the system or the person viewing the results struggles
with a data overload, more information that s/he/it (humm... have
to be careful with these gender neutral versions) wants or needs to
locate the group of files that define the object of interest. Given
that there is no precise definition for how to group files into a
data set, I think that we can reduce the amount of information that
we have to deal with to a reasonable view of the all the data on the
system without losing much if anything. The crawler is likely to group
the files slightly differently in some cases than the human would, but
one could probably discover this pretty quickly and steer the crawler
God, that Cornillon, what a windbag. Sowwy.
> > Our hope is that the crawler would work locally
> > building the inventory locally and could be made to run as often
> > as you like. However, the inventory need not reside at the site
> > containing the actual data and the crawler could be run from a
> > remote site as our prototype does. The point here is that there
> > are two types of crawlers generating two types of lists, one
> > that generates inventories of granules in data sets (generally
> > locally and can be run as often at you like) and the other generating
> > inventories of data sets - directories (generally run remotely
> > less often). Finally, I note that the inventory could be generated
> > in other ways, for example every time a granule is added to a
> > data set, the inventory could automatically be updated. I really
> > see the inventory issue as a local process. What is strange is
> > the number of data sets that we encounter that do not have a
> > formal inventory and this is what gives rise to this problem.
> Some possible terminology clarifications:
> We have been using the word "crawler" to mean a process that gets all of its
> information from the web/DODS server. So it cant see local disk files, but can
> be run remotely.
> A process that must run locally, and can have access to whatever files exists,
> we have been calling a "scanner" as in disk scanner.
OK, I like this terminology. Please replace crawler with scanner in what
I just wrote above and I will try to use this terminology in the future.
> Generating "inventories of granules in data sets" makes sense in the context
> an agg server, but is there also meaning to it in the context of a normal DODS
Not sure exactly what you mean here. We have file servers which are
inventories of granules in data sets. Actually the terminology is a
bit loose here also. The server in this case is a DODS FreeForm server.
It serves a table that contains a list of URLs with the characteristic(s)
that differentiate one URI from another, time in the case of our satellite
Graduate School of Oceanography - Telephone: (401) 874-6283
University of Rhode Island - FAX: (401) 874-6728
Narragansett RI 02882 USA - Internet: pcornillon@xxxxxxxxxxx