[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #WPM-702818]: pqact errors in LDM 6.7.0



Justin,

> 1) Our decoders share the same core as the GEMPAK decoders but ours
> write output in BUFR format.
> 
> 2) The decoders are writing data to a gpfs file system.
> 
> 3) On the specific node that LDM is running on we also run our data
> transfer software named DBNet, it basically is responsible for sending
> out all of the operational models out. This is a node on our
> supercomputer that is running all of the major models (NAM, GFS, etc).
> Currently we are running it in parallel since it is the next generation
> that we will be transitioning to.

Is is possible that the other things that the node does are so resource 
intensive that they cause the pqact(1) process or the decoder processes to be 
resource starved and, hence, slow?

> 4) When we have stopped LDM we have also stopped all other user run
> processes on the node, this has cleared whatever was the hung processes.

The LDM error messages are consistent with a "hung process" hypothesis only if 
the decoders are the hung processes.  Otherwise, the messages are consistent 
with some other process or processes consuming so many resources (CPU, memory, 
I/O) as to slow down the decoders to an unacceptable level.

> IBM spent quite a bit of time working with us today and examining the
> state of the system during one of these slowdowns, I'm looking forward
> to see what they come back with.
> 
> Thanks for thinking on this.
> 
> Justin


Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: WPM-702818
Department: Support LDM
Priority: Normal
Status: Closed