[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "pbuf_flush: time elapsed" problem



Justin,

>Date: Wed, 19 Oct 2005 07:44:56 -0400
>From: Justin Cooke <address@hidden>
>Organization: NOAA/NWS/FSL
>To: Steve Emmerson <address@hidden>
>Subject: Re: "pbuf_flush: time elapsed" problem

The above message contained the following:

[snip]
> Yes, I'm talking about the upstream LDM process.

[snip]
> > Would you be willing to modify the LDM source-code and then rebuild and
> > reinstall it with debugging and assertions enabled?
> 
> Yes we would

Good.  I'll let you know what to do.

> >> Something else that may be of interest, we noticed that after the feed
> >> stopped there was a defunct process with the PPID listed as the PID of
> >> our NEXRAD2 feed (output from ps -ef for the PID 1228948):
> >>
> >> dbndev  532636 1228948   0                  0:00 <defunct>
> >> dbndev 1228948 1028176   0   Oct 13      - 24:21 rpc.ldmd -v -q 
> >> /usr/ldm/data/ldm.pq /usr/ldm/etc/ldmd.conf
> >>
> >> Any ideas?
> >
> > This is extremely puzzling because upstream LDM processes don't call
> > fork(2) -- so they can't have child processes.
> >
> > grep(1) the LDM logfiles to verify that the PID is that of an upstream
> > LDM, e.g.
> >
> >     fgrep '[1228948]' `ls -rt logs/ldmd.log*`
> >
> 
> Here is some output from the grep:
> 
> Oct 18 14:15:01 b2n1 140.90.85.102[1228948] ERROR: Terminating due to LDM 
> failure; Connection to upstream LDM closed
> Oct 18 14:15:01 b2n1 140.90.85.102[1228948] NOTE: LDM-6 desired 
> product-class: 20051018141401.214 TS_ENDT {{NEXRAD2,  ".*"},{NONE,  
> "SIG=a239ff9ff6fa47cb8ab19f7c5e476ae1"}}
> Oct 18 14:16:17 b2n1 140.90.85.102[1228948] ERROR: Terminating due to LDM 
> failure; Couldn't connect to LDM on 140.90.85.102 using either port 388 or 
> portmapper; : RPC: Remote system error - A remote host did not respond within 
> the timeout period.
> Oct 18 14:16:18 b2n1 140.90.85.102[1228948] NOTE: LDM-6 desired 
> product-class: 20051018141401.214 TS_ENDT {{NEXRAD2,  ".*"},{NONE,  
> "SIG=a239ff9ff6fa47cb8ab19f7c5e476ae1"}}
> Oct 18 14:16:18 b2n1 140.90.85.102[1228948] NOTE: Product reclassification by 
> upstream LDM: 20051018141401.214 TS_ENDT {{NEXRAD2, ".*"},{NONE,  
> "SIG=a239ff9ff6fa47cb8ab19f7c5e476ae1"}} -> 20051018141401.214 TS_ENDT 
> {{NEXRAD2,  ".*"}}
> Oct 18 14:16:18 b2n1 140.90.85.102[1228948] NOTE: Upstream LDM-6 on 
> 140.90.85.102 is willing to be a primary feeder
> Oct 18 14:54:28 b2n1 140.90.85.102[1228948] NOTE: Going verbose
> Oct 18 14:54:29 b2n1 140.90.85.102[1228948] INFO:     9699 20051018145340.836 
> NEXRAD2 382027  L2-BZIP2/KBMX/20051018145001/382/27
[snip]

The above messages indicate, conclusively, that process 1228948 was
a downstream LDM and not an upstream LDM.  This is equally puzzling
because downstream LDMs don't call fork() either -- and so can't have
child processes.

More relevant, however, is your suggestion that process 1228948 was an
upstream LDM when it clearly wasn't.  Would you please explain this
discrepancy.

> The LDM system that feeds us is restarted twice a day, that's why there 
> is a connection failure ~14:15.  At 14:54 I sent the 1228948 process a 
> USR2 to go into verbose mode, once data stopped being received by the 
> upstream LDM we attached truss.
> 
> Again, this only seems to happen when the upstream ldm is in verbose 
> mode.  This process ran for 5 days in silent mode with no problems but 
> stopped after 3 hours once it was put into verbose.

Hmm... That information might help.  I'll need to know, however, whether
to look at the upstream or downstream LDM code.

> Thanks for continuing to look at this,

Thank you for bringing this up and continuing to work with me.

> Justin

Regards,
Steve Emmerson