[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: THREDDS/DLESE Connections slides

Peter Cornillon wrote:

we expect that data holdings can be divided into two categories. 1) sites in
which the monitoring (eg crawling) can be done occasionally (once a day, once an
hour, once a week?), and the impact of the crawling is therefore minimal. 2)
real-time sites that have constantly changing data. For these, we probably need
a different strategy, and we are considering instrumenting the LDM as one
possible solution.

But in sites that are being continuously updated, it seems to me
that you need a local inventory, a file or some other way of keeping track of the contents of a data set. This is our notion
of a file server or your configuration file in the Aggregation Server. This is the thing that you want to discover when searching
for data sets, not all of the files (or granules or whatever) in the data set. This is what we are wrestling with in the crawler that
we are looking at. In particular, I have asked Steve to look at
ways of having the crawler group files into data sets automatically
and then to reference the inventory for the data set rather than
the entire data set and to make the crawler capable of updating
the inventory.

Just to make sure i understand your terminology:

files = physical files
datasets = logical files we want the user to see
inventory = listing of datasets
granule = ??

what does it mean to "group files into data sets"? like the agg server?

Our hope is that the crawler would work locally
building the inventory locally and could be made to run as often
as you like. However, the inventory need not reside at the site containing the actual data and the crawler could be run from a
remote site as our prototype does. The point here is that there
are two types of crawlers generating two types of lists, one
that generates inventories of granules in data sets (generally locally and can be run as often at you like) and the other generating inventories of data sets - directories (generally run remotely
less often). Finally, I note that the inventory could be generated
in other ways, for example every time a granule is added to a data set, the inventory could automatically be updated. I really
see the inventory issue as a local process. What is strange is
the number of data sets that we encounter that do not have a
formal inventory and this is what gives rise to this problem.

Some possible terminology clarifications:

We have been using the word "crawler" to mean a process that gets all of its information from the web/DODS server. So it cant see local disk files, but can be run remotely.

A process that must run locally, and can have access to whatever files exists, we have been calling a "scanner" as in disk scanner.

Generating "inventories of granules in data sets" makes sense in the context of an agg server, but is there also meaning to it in the context of a normal DODS server?

NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.