[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "pbuf_flush: time elapsed" problem



Justin,

>Date: Tue, 18 Oct 2005 16:38:00 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA/NWS/FSL
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem

The above message contained the following:

> Our NEXRAD2 feed stopped and we had our sys admins execute truss on the 
> rpc.ldmd process; below is the log for ~25 mins.

I assume that when you say "our NEXRAD2 feed" that you mean the UPSTREAM
LDM process that's sending NEXRAD2 data-products.  Please tell me if
this assumption is incorrect.

> ***While truss was attached to the "hung" NEXRAD2 feed we executed a 
> kill -USR2 on the rpc.ldmd process, you can see where it caught it but 
> returned to the sleeping state.
> ----
> thread_waitlock_local(0x0000000000000000) (sleeping...)
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x0000000000000000) (sleeping...)
> thread_waitlock_local(0x0000000000000000)       = 1
>     Received signal #31, SIGUSR2 [caught]
> _getpid()                                       = 1228948
> send(2, 0x0FFFFFFFFFFF6C58, 63, 0)              = 63
> ksetcontext_sigreturn(0x0FFFFFFFFFFF7680, 0x0000000000000000, 
> 0x00000001100652B0, 0x800000000000D0B2, 0x0000000000000000, 
> 0x0000000000000000, 0x0000000000000000, 0x0000000000000000)
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)
>     Received signal #19, SIGCONT [caught]
> thread_waitlock_local(0x09001000A024F2F8) (sleeping...)

The above is consistent with the upstream LDM not seeing any
data-products that match its selection-criteria.  In that situation,
the upstream LDM receives a SIGCONT when a data-product is added to the
product-queue, wakes up, checks if any of the recent entries in the
product-queue match its selection-criteria, and then goes back to sleep
if none do.

Because you indicated that data-products were added to the product-queue
that do match the selection-criteria for that upstream LDM, I can only
conclude that the selection-criteria are being corrupted, somehow.

Would you be willing to modify the LDM source-code and then rebuild and
reinstall it with debugging and assertions enabled?

> Something else that may be of interest, we noticed that after the feed 
> stopped there was a defunct process with the PPID listed as the PID of 
> our NEXRAD2 feed (output from ps -ef for the PID 1228948):
>  
> dbndev  532636 1228948   0                  0:00 <defunct>
> dbndev 1228948 1028176   0   Oct 13      - 24:21 rpc.ldmd -v -q 
> /usr/ldm/data/ldm.pq /usr/ldm/etc/ldmd.conf
> 
> Any ideas?

This is extremely puzzling because upstream LDM processes don't call
fork(2) -- so they can't have child processes.

grep(1) the LDM logfiles to verify that the PID is that of an upstream
LDM, e.g.

    fgrep '[1228948]' `ls -rt logs/ldmd.log*`

> Thanks,
> 
> Jusitn

Regards,
Steve Emmerson