Re: THREDDS/DLESE Connections slides

To: thredds <thredds@xxxxxxxxxxxxxxxx>
Subject: Re: THREDDS/DLESE Connections slides
From: John Caron <caron@xxxxxxxx>
Date: Fri, 14 Dec 2001 17:11:36 -0700



Peter Cornillon wrote:

We have built a prototype crawler that crawls a DODS site given the URLfor the site and finds all DODS files at the site. The problem is that
it has no way at present of differentiating between files in a data set
and the data set itself. At our site (a satellite archive) there arecurrently in excess of 50,000 files and will soon be in excess of 100,000.This makes sorting out the information returned by the crawler difficultat best. (In situ archives can have 100,000s to millions of files - oneper xbt depending on the organization of the site.) Steve Hankin's groupis working adding the ability to group files into data sets. I believe
that he is working with the GCMD on this.

good point - its not enough to discover URLSs, one needs to know what they mean.When theres a million files, the problem is not ignorable.

The situation with ADDE servers is somewhat different. You can (more or
less) query the server to find out whats available, but this collection of
information takes a while (eg 8 minutes for complete image data on unidata's
ADDE server), too long for interactive (eg MetApp) access.


But you still have to know the URL for the server itself. I assume that
there is more than one server? If that is the case there needs to be a
high level list somewhere of server sites. This high level list could
just as well be a list of data set URLs (where there might be a number

at a given site - back to the DODS data set list).

yes, you need the initial server URL, and the "cataloger" needs to maintain alist of the serevrs it wants to catalog. So adding some extra info like rootdirectories, or file filters etc, is not that much more to maintain.

Is your concern
currency at the directory? This is the issue that we hope to address
with the crawler. In fact, we hope to take the crawler one step farther
by adding a web page in the htdocs directory that says "I'm a DODS
server, here I am". A crawler can then not only crawl a given site
but when combined with a network crawler, crawl the entire network.
Well, not really, the way I see such a harvester is that it would use
existing repositories (dogpile, yahoo,...) to find server sites and
then direct the site crawler to crawl the sites.

we expect that data holdings can be divided into two categories. 1) sites inwhich the monitoring (eg crawling) can be done occasionally (once a day, once anhour, once a week?), and the impact of the crawling is therefore minimal. 2)real-time sites that have constantly changing data. For these, we probably needa different strategy, and we are considering instrumenting the LDM as onepossible solution.

we are not thinking about finding all possible datasets, just ones whose siteswant to be part of the THREDDS network.

Follow-Ups:
- Re: THREDDS/DLESE Connections slides
  - From: Peter Cornillon

References:
- THREDDS/DLESE Connections slides
  - From: John Caron
- Re: THREDDS/DLESE Connections slides
  - From: Peter Cornillon
- Re: THREDDS/DLESE Connections slides
  - From: Peter Cornillon
- Re: THREDDS/DLESE Connections slides
  - From: Peter Cornillon

2001 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: