[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #JIG-686458]: ldm 6.4.7.1 restart problem



Art,

> I've been having occasional problems with our LDM (6.4.7.1) "resetting"
> and instead of resuming data collection at the time it left off, it goes
> back as far as the "-m" setting in the upstream queue.  Below is an
> example from yesterday with iddrs3 being the downstream node and
> idd-ingest being the upstream:
> 
> Jan 14 16:29:24 idd-ingest iddrs3.meteo.psu.edu(feed)[19268] ERROR:
> Couldn't flush connection; nullproc_6() failure to iddrs3.meteo.psu.edu:
> RPC: Timed out
> Jan 14 16:29:46 idd-ingest iddrs3.meteo.psu.edu[19885] WARN:
> findTimeEntryWithOffset(): Target data-product with given metadata not
> found in time-map near its creation-time (20070114162655.320)

Data-products are indexed in the product-queue according to their
insertion-time and data-products have their creation-time as part
of their metadata.  As a consequence, it's very important that the
clocks on the various LDM machines be correct.  For example, if the
data-product that host iddrs3 last received from host idd-ingest
had a creation-time that was later than the time (according to the
system clock on idd-ingest) when it was inserted into idd-ingest's
product-queue, then the LDM on idd-ingest will be unable to find
that last, successfully-transmitted data-product.

Also, is the product-queue on idd-ingest large enough?  What's the
mean age of the oldest data-product?  What the minimum age of the
oldest product?  (Use pqmon(1) to discover this.)

> Jan 14 16:29:46 idd-ingest iddrs3.meteo.psu.edu[19885] NOTE: Data-product
> with signature e364b6103d1b4037e788f32c5516c86b wasn't found in
> product-queue
> Jan 14 16:29:46 idd-ingest iddrs3.meteo.psu.edu(feed)[19885] NOTE:
> Starting Up(6.4.7.1/6): 20070114102944.933 TS_ENDT {{ANY,  ".*"}},
> SIG=e364b6103d1b4037e788f32c5516c86b, Primary
> Jan 14 16:29:46 idd-ingest iddrs3.meteo.psu.edu(feed)[19885] NOTE: topo:
> iddrs3.meteo.psu.edu {{ANY, (.*)}}
> 
> 
> Jan 14 16:29:34 iddrs3 idd-ingest.meteo.psu.edu[3470] ERROR: readtcp():
> EOF on socket 4
> Jan 14 16:29:44 iddrs3 idd-ingest.meteo.psu.edu[3470] ERROR:
> one_svc_run(): RPC layer closed connection
> Jan 14 16:29:44 iddrs3 idd-ingest.meteo.psu.edu[3470] ERROR: Disconnecting
> due to LDM failure; Connection to upstream LDM closed
> Jan 14 16:29:44 iddrs3 idd-ingest.meteo.psu.edu[3470] NOTE: LDM-6 desired
> product-class: 20070114102944.933 TS_ENDT {{ANY,  ".*"},{NONE,
> "SIG=e364b6103d1b4037e788f32c5516c86b"}}

The LDM on iddrs3 is asking for data that was created about 4 hours ago.
Is the maximum acceptable latency on iddrs3 really 4 hours?

> Jan 14 16:29:45 iddrs3 idd-ingest.meteo.psu.edu[3470] NOTE: Upstream LDM-6
> on idd-ingest.meteo.psu.edu is willing to be a primary feeder
> 
> Any ideas on what might be causing this?
> 
> Thanks.
> 
> Art
> 
> Arthur A. Person
> Research Assistant, System Administrator
> Penn State Department of Meteorology
> email:  address@hidden, phone:  814-863-1563

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: JIG-686458
Department: Support LDM
Priority: Normal
Status: On Hold