Peter Cornillon wrote:
We have built a prototype crawler that crawls a DODS site given the URL
for the site and finds all DODS files at the site. The problem is that
it has no way at present of differentiating between files in a data set
and the data set itself. At our site (a satellite archive) there are
currently in excess of 50,000 files and will soon be in excess of 100,000.
This makes sorting out the information returned by the crawler difficult
at best. (In situ archives can have 100,000s to millions of files - one
per xbt depending on the organization of the site.) Steve Hankin's group
is working adding the ability to group files into data sets. I believe
that he is working with the GCMD on this.
good point - its not enough to discover URLSs, one needs to know what they mean.
When theres a million files, the problem is not ignorable.
The situation with ADDE servers is somewhat different. You can (more or
less) query the server to find out whats available, but this collection of
information takes a while (eg 8 minutes for complete image data on unidata's
ADDE server), too long for interactive (eg MetApp) access.
But you still have to know the URL for the server itself. I assume that
there is more than one server? If that is the case there needs to be a
high level list somewhere of server sites. This high level list could
just as well be a list of data set URLs (where there might be a number
at a given site - back to the DODS data set list).
yes, you need the initial server URL, and the "cataloger" needs to maintain a
list of the serevrs it wants to catalog. So adding some extra info like root
directories, or file filters etc, is not that much more to maintain.
Is your concern
currency at the directory? This is the issue that we hope to address
with the crawler. In fact, we hope to take the crawler one step farther
by adding a web page in the htdocs directory that says "I'm a DODS
server, here I am". A crawler can then not only crawl a given site
but when combined with a network crawler, crawl the entire network.
Well, not really, the way I see such a harvester is that it would use
existing repositories (dogpile, yahoo,...) to find server sites and
then direct the site crawler to crawl the sites.
we expect that data holdings can be divided into two categories. 1) sites in
which the monitoring (eg crawling) can be done occasionally (once a day, once an
hour, once a week?), and the impact of the crawling is therefore minimal. 2)
real-time sites that have constantly changing data. For these, we probably need
a different strategy, and we are considering instrumenting the LDM as one
we are not thinking about finding all possible datasets, just ones whose sites
want to be part of the THREDDS network.