John Caron wrote:
> Peter Cornillon wrote:
> > We have built a prototype crawler that crawls a DODS site given the URL
> > for the site and finds all DODS files at the site. The problem is that
> > it has no way at present of differentiating between files in a data set
> > and the data set itself. At our site (a satellite archive) there are
> > currently in excess of 50,000 files and will soon be in excess of 100,000.
> > This makes sorting out the information returned by the crawler difficult
> > at best. (In situ archives can have 100,000s to millions of files - one
> > per xbt depending on the organization of the site.) Steve Hankin's group
> > is working adding the ability to group files into data sets. I believe
> > that he is working with the GCMD on this.
> good point - its not enough to discover URLSs, one needs to know what they
> When theres a million files, the problem is not ignorable.
> >>The situation with ADDE servers is somewhat different. You can (more or
> >>less) query the server to find out whats available, but this collection of
> >>information takes a while (eg 8 minutes for complete image data on unidata's
> >>ADDE server), too long for interactive (eg MetApp) access.
> > But you still have to know the URL for the server itself. I assume that
> > there is more than one server? If that is the case there needs to be a
> > high level list somewhere of server sites. This high level list could
> > just as well be a list of data set URLs (where there might be a number
> > at a given site - back to the DODS data set list).
> yes, you need the initial server URL, and the "cataloger" needs to maintain a
> list of the serevrs it wants to catalog. So adding some extra info like root
> directories, or file filters etc, is not that much more to maintain.
> > Is your concern
> > currency at the directory? This is the issue that we hope to address
> > with the crawler. In fact, we hope to take the crawler one step farther
> > by adding a web page in the htdocs directory that says "I'm a DODS
> > server, here I am". A crawler can then not only crawl a given site
> > but when combined with a network crawler, crawl the entire network.
> > Well, not really, the way I see such a harvester is that it would use
> > existing repositories (dogpile, yahoo,...) to find server sites and
> > then direct the site crawler to crawl the sites.
> we expect that data holdings can be divided into two categories. 1) sites in
> which the monitoring (eg crawling) can be done occasionally (once a day, once
> hour, once a week?), and the impact of the crawling is therefore minimal. 2)
> real-time sites that have constantly changing data. For these, we probably
> a different strategy, and we are considering instrumenting the LDM as one
> possible solution.
But in sites that are being continuously updated, it seems to me
that you need a local inventory, a file or some other way of
keeping track of the contents of a data set. This is our notion
of a file server or your configuration file in the Aggregation
Server. This is the thing that you want to discover when searching
for data sets, not all of the files (or granules or whatever) in
the data set. This is what we are wrestling with in the crawler that
we are looking at. In particular, I have asked Steve to look at
ways of having the crawler group files into data sets automatically
and then to reference the inventory for the data set rather than
the entire data set and to make the crawler capable of updating
the inventory. Our hope is that the crawler would work locally
building the inventory locally and could be made to run as often
as you like. However, the inventory need not reside at the site
containing the actual data and the crawler could be run from a
remote site as our prototype does. The point here is that there
are two types of crawlers generating two types of lists, one
that generates inventories of granules in data sets (generally
locally and can be run as often at you like) and the other generating
inventories of data sets - directories (generally run remotely
less often). Finally, I note that the inventory could be generated
in other ways, for example every time a granule is added to a
data set, the inventory could automatically be updated. I really
see the inventory issue as a local process. What is strange is
the number of data sets that we encounter that do not have a
formal inventory and this is what gives rise to this problem.
> we are not thinking about finding all possible datasets, just ones whose sites
> want to be part of the THREDDS network.
Graduate School of Oceanography - Telephone: (401) 874-6283
University of Rhode Island - FAX: (401) 874-6728
Narragansett RI 02882 USA - Internet: pcornillon@xxxxxxxxxxx