John Caron wrote: > > Peter Cornillon wrote: > > > > > We have built a prototype crawler that crawls a DODS site given the URL > > for the site and finds all DODS files at the site. The problem is that > > it has no way at present of differentiating between files in a data set > > and the data set itself. At our site (a satellite archive) there are > > currently in excess of 50,000 files and will soon be in excess of 100,000. > > This makes sorting out the information returned by the crawler difficult > > at best. (In situ archives can have 100,000s to millions of files - one > > per xbt depending on the organization of the site.) Steve Hankin's group > > is working adding the ability to group files into data sets. I believe > > that he is working with the GCMD on this. > > good point - its not enough to discover URLSs, one needs to know what they > mean. > When theres a million files, the problem is not ignorable. > > > > > > >>The situation with ADDE servers is somewhat different. You can (more or > >>less) query the server to find out whats available, but this collection of > >>information takes a while (eg 8 minutes for complete image data on unidata's > >>ADDE server), too long for interactive (eg MetApp) access. > >> > > > > But you still have to know the URL for the server itself. I assume that > > there is more than one server? If that is the case there needs to be a > > high level list somewhere of server sites. This high level list could > > just as well be a list of data set URLs (where there might be a number > > at a given site - back to the DODS data set list). > > yes, you need the initial server URL, and the "cataloger" needs to maintain a > list of the serevrs it wants to catalog. So adding some extra info like root > directories, or file filters etc, is not that much more to maintain. > > > Is your concern > > currency at the directory? This is the issue that we hope to address > > with the crawler. In fact, we hope to take the crawler one step farther > > by adding a web page in the htdocs directory that says "I'm a DODS > > server, here I am". A crawler can then not only crawl a given site > > but when combined with a network crawler, crawl the entire network. > > Well, not really, the way I see such a harvester is that it would use > > existing repositories (dogpile, yahoo,...) to find server sites and > > then direct the site crawler to crawl the sites. > > we expect that data holdings can be divided into two categories. 1) sites in > which the monitoring (eg crawling) can be done occasionally (once a day, once > an > hour, once a week?), and the impact of the crawling is therefore minimal. 2) > real-time sites that have constantly changing data. For these, we probably > need > a different strategy, and we are considering instrumenting the LDM as one > possible solution. But in sites that are being continuously updated, it seems to me that you need a local inventory, a file or some other way of keeping track of the contents of a data set. This is our notion of a file server or your configuration file in the Aggregation Server. This is the thing that you want to discover when searching for data sets, not all of the files (or granules or whatever) in the data set. This is what we are wrestling with in the crawler that we are looking at. In particular, I have asked Steve to look at ways of having the crawler group files into data sets automatically and then to reference the inventory for the data set rather than the entire data set and to make the crawler capable of updating the inventory. Our hope is that the crawler would work locally building the inventory locally and could be made to run as often as you like. However, the inventory need not reside at the site containing the actual data and the crawler could be run from a remote site as our prototype does. The point here is that there are two types of crawlers generating two types of lists, one that generates inventories of granules in data sets (generally locally and can be run as often at you like) and the other generating inventories of data sets - directories (generally run remotely less often). Finally, I note that the inventory could be generated in other ways, for example every time a granule is added to a data set, the inventory could automatically be updated. I really see the inventory issue as a local process. What is strange is the number of data sets that we encounter that do not have a formal inventory and this is what gives rise to this problem. > > we are not thinking about finding all possible datasets, just ones whose sites > want to be part of the THREDDS network. -- Peter Cornillon Graduate School of Oceanography - Telephone: (401) 874-6283 University of Rhode Island - FAX: (401) 874-6728 Narragansett RI 02882 USA - Internet: address@hidden
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.