[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Broken pipe (was: hung downstream LDM)



Justin,

Does the decoder

    bin/decod_dcrast

close its standard input stream?  If so, what criteria does it use to
make that decision?

Does the same decoder terminate after completely reading a data-product
on its standard input stream but before it reads an end-of-file on that
stream?

Regards,
Steve Emmerson

------- Original Message

Date:    Mon, 14 Nov 2005 10:01:49 -0500
From:    Justin Cooke <address@hidden>To:      Steve Emmerson <steve@uni
          data.ucar.edu>
cc:      address@hidden, Paula Freeman <address@hidden>
Subject: Re: hung downstream LDM

Hi Steve,
 
Steve Emmerson wrote:

>>Well we are now at day 6 and the NEXRAD2 feed is still going!
>>
>>I'm not sure at what time mark we call this a success but it looks 
>>really good so far.
>>    
>>
>
>You said previously that the time-to-failure ranged from 6 to 50 hours.
>Six days is 144 hours, so I'm quite happy with the results.
>  
>

Well we are now at 12 days (288 hours) without any stoppage of the 
NEXRAD2 feed :).

But we are seeing a few odd things in our ldmd.log, we are getting 
several "broken_pipe" errors which then cause a write error.  I passed 
this along to our main decoder developer (Jeff Ator) and here is his 
response:
---
Hmm, this is interesting, and it looks like it's something peculiar to 
this new LDM 6.4.3.0 that went in on 11/2/05, because these same 
"Broken_pipe" messages aren't showing up in the LDM log files prior to 
that date (i.e. ldmd.log.15, ldmd.log.16, ..., ldmd.log.20).  What's 
really interesting is that these messages correspond exactly to when a 
decoder starts up, i.e. when a bulletin comes in for a particular 
decoder that isn't already running, and therefore pqact needs to start 
it up by forking a child process.  In other words, the first bulletin 
that causes the new decoder process to start up is also the same one 
that is generating the "Broken_pipe" message in the logs, and it's 
happening for all of the decoders!  Now this isn't a problem per-se, 
because these same bulletins are actually getting into the respective 
decoders, as I could confirm within the actual decoder logs themselves 
(perhaps this is due to the "pipe_prodput: trying again" completing 
successfully, as you pointed out below(?)).  Either way, it's a bit 
unnerving, not to mention misleading, to suddenly be seeing all of these 
"Broken_pipe" messages, especially since they weren't occurring prior to 
the installation of the new LDM 6.4.3.0 build.
----

Here are a few of the log file entires:
---
Nov 14 05:03:15 b2n1 pqact[434350] INFO:                pipe: 
bin/decod_dcrast -v 2  -t 600 -d 
/dcomdev/us007003/decoder_logs/decod_dcrast.log  
/dcomdev/us007003/bufrtab.FSL_RAST   /dcomdev/us007003/bufrtab.002
Nov 14 05:03:15 b2n1 pqact[434350] ERROR: pbuf_flush (27) write: Broken pipe
Nov 14 05:03:15 b2n1 pqact[434350] ERROR: pipe_put: 
bin/decod_dcrast-v2-t600-d/dcomdev/us007003/decoder_logs/decod_dcrast.log/dcomd
ev/us007003/bufrtab.FSL_RAST/dcomdev/us007003/bufrtab.002 
write error
Nov 14 05:03:15 b2n1 pqact[434350] ERROR: pipe_prodput: trying again
...
Nov 14 05:13:42 b2n1 pqact[434350] INFO:                pipe: 
bin/decod_dccgrd -v 2 -t 300 -d 
/dcomdev/us007003/decoder_logs/decod_dccgrd.log   
/dcomdev/us007003/bufrtab.001        tables/stns/cg.tbl
Nov 14 05:13:42 b2n1 pqact[434350] ERROR: pbuf_flush (36) write: Broken pipe
Nov 14 05:13:42 b2n1 pqact[434350] ERROR: pipe_put: 
bin/decod_dccgrd-v2-t300-d/dcomdev/us007003/decoder_logs/decod_dccgrd.log/dcomd
ev/us007003/bufrtab.001tables/stns/cg.tbl 
write error
Nov 14 05:13:42 b2n1 pqact[434350] ERROR: pipe_prodput: trying again
---

I just stumbled on this when doing a grep for "error" in ldmd.log even 
though we've had no reported problems.

Any ideas?

Thanks again for all the attention you have given this,

Justin

------- End of Original Message