[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #VAP-368514]: Primary/alternate switching mixes in old data with new



Art,

> I'm having trouble understanding this.  If the controlling factor is a
> combination of the local queue size and the requested offset "from" time,
> why do I only see the problem when switching to idd.cise-nsf.gov?
> Wouldn't you expect to see the jump switching to iddrs3 as well?  Also,
> the jump doesn't occur every time but rather seems to occur randomly...
> what would produce such an inconsistent result?  Does the remote queue
> size come into play here?

In a distributed, asynchronous system, things can get complicated very quickly. 
 Figuring out all the details can take time.

My hypothesis for the large latency spikes is that the upstream LDM at Cise-Nsf 
that's feeding Iddrs3 is working on the oldest products in its queue rather 
than the newest ((the cause of this might be the poor connection between PSU 
and NSF).  As a consequence, when the corresponding downstream LDM on Iddrs3 
decides to switch to primary-mode, its request in the new connection specifies 
a last-received data-product that doesn't exist in the Cise-Nsf queue.  As a 
consequence, the new Cise-Nsf upstream LDM uses the "from" time in the request. 
 Because this time is greater than the oldest product in the Cise-Nsf queue, 
the upstream LDM starts sending products beginning with the oldest -- and those 
products are no longer in the Iddrs3 queue.  Hence, the large latency spikes.

The solution is to ensure that the "from" time in a data request is never older 
than the oldest product in the local queue; otherwise, duplicate product 
rejection can't occur.

> The queue size is 8 GB on both iddrs3 and idd-ingest.  I monitor the
> oldest products on our ls3 system which also has an 8 GB queue and it gets
> down to around 2000 seconds, so as you say, raising the queue size would
> be the only option.  However, I only have 8 GB of memory in iddrs3 and
> raising the queue size would make me concerned about the potential for a
> thrashing situation should requests be made for older products on disk.

We've always recommended that the LDM computer have sufficient physical memory 
to hold the product-queue in memory.  Recently, however, we've had success with 
queues that are significantly larger than physical memory.  Your mileage may 
vary (there's no way to tell at present).

If you have gnuplot(1) installed, then the "ldmadmin addmetrics" and "ldmadmin 
plotmetrics" commands are a good way of monitoring the age of the oldest 
product in the queue.

> I've changed the offset time to 2000 seconds temporarily on iddrs3 as a
> test to see if the problem goes away.  Is there any way to fix this in the
> LDM so it's not sensitive to the oldest queue boundary, or is the
> algorithm too complex?

Other than the solution I've mentioned, there's no easy fix for this.  The 
downstream site must have a record of received data-products and, right now, 
that record is the product-queue.  I'll think about some more complicated 
solutions, of course, but I can't say when (or even if) they'll happen.

> Thanks...
> 
> Art
> 
> Arthur A. Person
> Research Assistant, System Administrator
> Penn State Department of Meteorology
> email:  address@hidden, phone:  814-863-1563

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: VAP-368514
Department: Support LDM
Priority: Normal
Status: Closed