[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #MYM-282372]: ldm 6.13.10



Hi Steve,

> We've been seeing sporadic problems with what appears to be a pqact
> instance in ldm 6.13.10: Background:
> 
> 1. We have 2 machines running LDM; ubuntu 8.04 machine with 8Gb memory
> running a 2Gb queue in memory, and a ubuntu 16.04 machine with 8Gb
> memory running a 4Gb queue on an SSD drive.
> 
> 2. Both machines run 5 pqact instances; 3 processing data from a NOAAPort
> dish, 2 processing data from the internet. Both machines exhibit the
> same problem. The pqact instance that processes the NOAAPort dish NEXRAD
> data seems to hang.
> 
> 3. Symptoms:
> A "ps" shows the NEXRAD pqact instance running.
> A "top" shows no activity with that process. The other pqact processes
> do show up in top output.
> The ldmd.log file shows NEXRAD3 feedtype data being processed, but the
> files are not written to disk.
> A "pqmon" run shows the "age" field increasing to very large numbers as
> if no products were being processed, but products in other
> feedtypes/pqact-instances are being processed normally. Only the NEXRAD
> data is not writing to disk.
> 
> 4. Investigation weirdness:
> "ldmadmin stop" fails to stop all processes and hangs. A Cntrl-C of it
> shows all pqact instances are killed, and all but one ldmd instances are
> killed. A manual "kill -9" dispatches the last ldmd process.
> 
> While the "hang" is in progress, I've done a "kill -9" on the unresponsive
> pqact instance while running a "pqmon -i 5" in another terminal. As
> soon as that pqact instance is killed, succeeding pqmon output shows the
> "age" field dropping immediately from 35,000+ to a more normal 800-1100
> seconds for the ubuntu 16.04 machine (4Gb queue). A subsequent manual
> start of the pqact instance results in NEXRAD data being written to
> disk normally. This occurs *without* touching the queue in any way. A
> subsequent ldmadmin stop/start sequence works fine, with *all* pqact
> instances writing data to disk normally.
> 
> The queue appears to be fine on both machines, as a killed then restarted
> NEXRAD pqact instance works fine... until it hangs again. The hangs
> are sporadic in timing, and only the NEXRAD pqact hangs. We did not see
> these problems with LDM 6.13.5 on either machine.
> 
> Any ideas or suggestions?

Your description is consistent with a latent bug in the logging module we 
discovered that can sporadically cause a thread to deadlock with itself.

The solution is to install LDM version 6.13.11, which should be released today.

Alternatively, you could immediately install release-candidate 
<ftp://ftp.unidata.ucar.edu/pub/ldm/beta/ldm-6.13.11.61.tar.gz>, upon which the 
official release will be based.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: MYM-282372
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.