[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20020424: 20020419:RPC runaway RECLASS & h_clnt_call pqmon



Hi Kevin, 

comments embedded below..

-Jeff
____________________________                  _____________________
Jeff Weber                                    address@hidden
Unidata Support                               PH:303-497-8676 
NWS-COMET Case Study Library                  FX:303-497-8690
University Corp for Atmospheric Research      3300 Mitchell Ln
http://www.unidata.ucar.edu/staff/jweber      Boulder,Co 80307-3000
________________________________________      ______________________

On Wed, 24 Apr 2002, Unidata Support wrote:

> 
> Kevin,
> 
> Regarding the del_oldest message. This is a bug in the LDM 5.1.2.
> It is caused by a slow downstream feeded maintaining a lock on your 
> data queue so that new data cannot be inserted. This bug was fixed in 5.1.3 
> and later. This bug will cause you to lose incoming data.
> 
> Steve Chiswell
> Unidata User Support
> 
> 
> 
> 
> >From: "Kevin R. Tyle" <address@hidden>
> >Organization: UCAR/Unidata
> >Keywords: 200204241534.g3OFYWa03571
> 
> >Hi Jeff,
> >
> >We're running LDM 5.1.2 on Solaris 2.6 (SPARC).
> >
> >We haven't had any complaints from URI or NYCC, but I will contact
> >them to follow up.
> >
> >Meanwhile, more strange entries in our log that we see from time to time:
> >looks like some problem with the product queue, associated with
> >our NMC2 feeds:
> >
> >Apr 24 09:20:23 redwood nmc2_3[19191]: FEEDME(nmc2_3): OK
> >Apr 24 09:20:23 redwood nmc2_0[19188]: FEEDME(nmc2_0): OK
> >Apr 24 09:20:23 redwood nmc2_1[19189]: run_requester: 20020424091825.606
> >TS_ENDT {{NMC2,  "(.status|.*MT.mrf|.*MT.e
> >nsg).*[16]$"}}
> >Apr 24 09:20:23 redwood nmc2_1[19189]: FEEDME(nmc2_1): OK
> >Apr 24 09:20:27 redwood nmc2_3[19191]: pq_del_oldest: conflict on
> >734297512
> >Apr 24 09:20:27 redwood nmc2_3[19191]: comings: pqe_new: Resource
> >temporarily unavailable
> >Apr 24 09:20:27 redwood nmc2_3[19191]:        :
> >aebe6038a6265b9f150007fa34419767    89695 20020424091829.399    NMC
> >2 073  /afs/.nwstg.
> >Apr 24 09:20:27 redwood nmc2_3[19191]: Connection reset by peer
> >Apr 24 09:20:27 redwood nmc2_3[19191]: Disconnect
> >Apr 24 09:20:28 redwood nmc2_0[19188]: pq_del_oldest: conflict on
> >734297512
> >Apr 24 09:20:28 redwood nmc2_0[19188]: comings: pqe_new: Resource
> >temporarily unavailable
> >Apr 24 09:20:28 redwood nmc2_0[19188]:        :
> >dedc129db615131ee7b1b395762b2657    89695 20020424091819.588    NMC
> >2 140  /afs/.nwstg.
> >Apr 24 09:20:28 redwood nmc2_0[19188]: Connection reset by peer
> >Apr 24 09:20:28 redwood nmc2_0[19188]: Disconnect
> >Apr 24 09:20:28 redwood nmc2_1[19189]: pq_del_oldest: conflict on
> >734297512
> >Apr 24 09:20:28 redwood nmc2_1[19189]: comings: pqe_new: Resource
> >temporarily unavailable
> >Apr 24 09:20:28 redwood nmc2_1[19189]:        :
> >113f837e7a1db8593952513ba53e60dc    89695 20020424091828.213    NMC
> >2 056  /afs/.nwstg.
> >Apr 24 09:20:28 redwood nmc2_1[19189]: Connection reset by peer
> >Apr 24 09:20:28 redwood nmc2_1[19189]: Disconnect
> >Apr 24 09:20:28 redwood nmc2_2[19190]: pq_del_oldest: conflict on
> >734297512
> >Apr 24 09:20:28 redwood nmc2_2[19190]: comings: pqe_new: Resource
> >temporarily unavailable
> >Apr 24 09:20:28 redwood nmc2_2[19190]:        :
> >ed92cb930d1c9032c0964c941502c77b    89695 20020424091828.571    NMC
> >2 062  /afs/.nwstg.
> >Apr 24 09:20:28 redwood nmc2_2[19190]: Connection reset by peer
> >Apr 24 09:20:28 redwood nmc2_2[19190]: Disconnect
> >Apr 24 09:20:30 redwood nmc2_4[19192]: pq_del_oldest: conflict on
> >718812184
> >Apr 24 09:20:30 redwood nmc2_4[19192]: comings: pqe_new: Resource
> >temporarily unavailable
> >Apr 24 09:20:30 redwood nmc2_4[19192]:        :
> >d2d080a32b0065f9916cb53559a73f17    89695 20020424091833.465    NMC
> >2 139  /afs/.nwstg.
> >
> >Also, what do these entries mean:
> >
> >Apr 24 08:43:40 redwood pqact[19184]: pbuf_flush 11: time elapsed
> >2.100649
> >Apr 24 08:43:45 redwood pqact[19184]: pbuf_flush 11: time elapsed
> >2.072335
> >Apr 24 08:43:55 redwood pqact[19184]: pbuf_flush 11: time elapsed
> >2.186310
> >Apr 24 08:44:01 redwood pqact[19184]: pbuf_flush 11: time elapsed
> >2.003807
> >Apr 24 08:44:06 redwood pqact[19184]: pbuf_flush 11: time elapsed
> >2.175205
> >
> >(many more of these follow)


Kevin, 

There is a buffer for pqact...

The number after the pbuf_flush is the pipe that was "full" during the
process...if pqact is still acting on a file it will flush the buffer.

The pbuf_flush 11 means that the ACTION ie ( FILE, STDIO, PIPE, etc )
could not write to the destinations because the resource was not available
at that time or it was blocked.  The log entry ususally occurs when disk
activity is at it's peak for some reason.  It could be the LDM doing many
disk writes or another process tying up the disk.  time elapsed 2.175205
means that it took 2.17 sec to complete the activity,

If this occurs a bunch of times, one could start to loose data.  Then it's
time to determine the reason for the pbuf_flush entries.

and more below...



> >I'd guess the following explanations are possible:
> >
> >1) Product queue is filling up with stuff from the NMC2 feed
> >
> >2) Machine can't keep up with the data stream (it is a rather old
> >server)
> >
> >If we see output such as the above, does this indicate only a temporary
> >problem with the queue, or should we assume it is corrupted and
> >rebuild it?
> >
> >Here is output from pqmon: can you explain what we should look for here?
> >
> >Apr 24 15:31:26 pqmon: Starting Up (19755)
> >Apr 24 15:31:26 pqmon: nprods nfree  nempty      nbytes  maxprods  maxfree  
> >mi
> > nempty    maxext  age
> >Apr 24 15:31:26 pqmon: 107050    38  112638   867681744    142556     1270   
> > 
> >  77168   1442584 22942
> >Apr 24 15:31:26 pqmon: Exiting


Here the most important variable is age..in seconds. It appears as if you
have about 6-7 hours of data in your queue.

nprods=# of products
nfree=# of free slots to write to
nempty=# of free slots
nbytes=approx queue size, slightly less due to mapping
maxprods_max prods you can have assuming a ~4k slot
and so on...

From local machine..

local/ldm% pqmon
Apr 24 21:26:59 pqmon: Starting Up (6419)
Apr 24 21:26:59 pqmon: nprods nfree  nempty      nbytes  maxprods  maxfree
minempty    maxext  age
Apr 24 21:26:59 pqmon:  86214    69  401998  1806011280    194614     2809
293662   9029168 5392

Apr 24 21:26:59 pqmon: Exiting



Thank you,

Jeff Weber
Unidata Support




> >
> >Thanks,
> >
> >Kevin
> >
> >
> >______________________________________________________________________
> >Kevin Tyle, Systems Administrator               **********************
> >Dept. of Earth & Atmospheric Sciences           address@hidden
> >University at Albany, ES-235                    518-442-4571 (voice)
> >1400 Washington Avenue                          518-442-5825 (fax)
> >Albany, NY 12222                                **********************
> >______________________________________________________________________
> >
> >On Mon, 22 Apr 2002, Jeff Weber wrote:
> >
> >> Kevin,
> >>
> >> Nice work isolating those RPC calls!
> >>
> >> I have no idea why they do not die...we will investigate.
> >>
> >> What is your OS and LDM version?
> >>
> >>
> >> Have URI or NYCC complained about the latencies in their feeds?
> >>
> >> It may be worth contacting them and determining what the status of their
> >> data needs are..With this type of latency it may be best for them to
> >> throttle back the feed and get what they want in a timely manner, or if
> >> they are using the HDS/HRS feed in a non-real time environment, this may
> >> be working for them (but I bet they are dropping products)...they may want
> >> to subset the HDS and MCIDAS feeds, as they are the most voluminous.
> >>
> >> Thanks for the follow up,
> >>
> >> -Jeff
> >
> 
>