[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20051208: LDM on ensemble.ecmwf.int (cont.)



>From:  David Ian Brown <address@hidden>
>Organization:  UCAR/Unidata
>Keywords:  200512081747.jB8HlM7s017330

Dave, et. al.,

>I tried reconfiguring the ldmd.conf file on dataportal with
>the request line:
>request SPARE ".*" ensemble.ecmwf.int primary
>in place of
>request SPARE ".*" teaccess.ecmwf.int primary
>
>but the log now has many lines similar to  the following:
>
>Dec 08 17:30:45 dataportal ensemble[28056] ERROR: Terminating due to  
>LDM failure; Couldn't get IP address of host ensemble.ecmwf.int
>  -dave

This appears to be a DNS issue.

>Also I'd like a bit more advice on how to proceed with testing.
>So far I have not actually saved any data. I assume the queue just
>overwrites the oldest data,

Yes, the LDM queue module will delete the oldest products in the queue
to make room for new ones.  The age of the oldest queue product can be
see using pqmon:

<as 'ldm'>
pqmon -l-

Here is an example from one of our NOAAPORT ingest machines:

pqmo08 17:52:50 pqmon NOTE: Starting Up (14912)
Dec 08 17:52:50 pqmon NOTE: nprods nfree  nempty      nbytes  maxprods  maxfree 
minempty    maxext  age
Dec 08 17:52:50 pqmon NOTE: 108726     1  135413   999632480    165656        6 
   78483    369056 6220
Dec 08 17:52:50 pqmon NOTE: Exiting

The last listed value is the age of the oldest product in the queue in
seconds.

>and that for now we are just testing to see
>if the data can be transferred quickly enough between sites.

Yes.  This is the first step.  One word of advice.  We may find that we
need to split the data requests into several, disjoint ones.  This
technique helps mitigate the backoff feature of the current
implementations of TCP.  If/when TCP gets updated with fast TCP, this
feed splitting should no longer be necessary.

>I see the statistics for dataportal are now visible at
>http://www.unidata.ucar.edu/software/idd/rtstats/siteindex.php? 
>dataportal.ucar.edu

Yes.

>The graphs seem to indicate that data transfer to dataportal stopped  
>yesterday
>at around 0600 hours. Did something happen externally or has something  
>gone
>wrong with the dataportal ldm that I need to attend to?

All:  there have been multiple reports to Unidata User Support of LDMs
stopping yesterday at 5Z.  I experienced this also on a dual Xeon
EM64T, Fedora Core 4 64-bit machine running LDM-6.4.2 in my office.  I
was able to restart the LDM after deleting and remaking the LDM queue
twice.  My gut feeling is that the assertion failure that was
reported:

Dec 08 05:24:22 ldm thelma[8869]: assertion "n > 0" failed: file "pq.c", line 
2187

was somehow related to the time (!?).  If this hunch is true, it seems
to me that one should be able to restart the LDM without deleting and
remaking the queue.  Anyone who sees this problem listed in their LDM
log file: please report the failure to Unidata User Support
<address@hidden>.  Thanks!

Our LDM developer, Steve Emmerson will be looking at this failure when
he returns from the AGU meeting.

Just so you know, this is the first time we have been this failure
in any LDM-6 installation.

Cheers,

Tom
--
+-----------------------------------------------------------------------------+
* Tom Yoksas                                             UCAR Unidata Program *
* (303) 497-8642 (last resort)                                  P.O. Box 3000 *
* address@hidden                                   Boulder, CO 80307 *
* Unidata WWW Service                             http://www.unidata.ucar.edu/*
+-----------------------------------------------------------------------------+