Peter Cornillon wrote:
We have built a prototype crawler that crawls a DODS site given the URL for the site and finds all DODS files at the site. The problem is that
it has no way at present of differentiating between files in a data set
and the data set itself. At our site (a satellite archive) there are currently in excess of 50,000 files and will soon be in excess of 100,000. This makes sorting out the information returned by the crawler difficult at best. (In situ archives can have 100,000s to millions of files - one
per xbt depending on the organization of the site.) Steve Hankin's group is working adding the ability to group files into data sets. I believe
that he is working with the GCMD on this.
The situation with ADDE servers is somewhat different. You can (more or less) query the server to find out whats available, but this collection of information takes a while (eg 8 minutes for complete image data on unidata's ADDE server), too long for interactive (eg MetApp) access.
But you still have to know the URL for the server itself. I assume that
there is more than one server? If that is the case there needs to be a
high level list somewhere of server sites. This high level list could
just as well be a list of data set URLs (where there might be a number
at a given site - back to the DODS data set list).
Is your concern currency at the directory? This is the issue that we hope to address with the crawler. In fact, we hope to take the crawler one step farther by adding a web page in the htdocs directory that says "I'm a DODS server, here I am". A crawler can then not only crawl a given site but when combined with a network crawler, crawl the entire network. Well, not really, the way I see such a harvester is that it would use existing repositories (dogpile, yahoo,...) to find server sites and then direct the site crawler to crawl the sites.
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.