[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20020424: 20020419:RPC runaway RECLASS & h_clnt_call



Kevin,

Regarding the del_oldest message. This is a bug in the LDM 5.1.2.
It is caused by a slow downstream feeded maintaining a lock on your 
data queue so that new data cannot be inserted. This bug was fixed in 5.1.3 
and later. This bug will cause you to lose incoming data.

Steve Chiswell
Unidata User Support




>From: "Kevin R. Tyle" <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200204241534.g3OFYWa03571

>Hi Jeff,
>
>We're running LDM 5.1.2 on Solaris 2.6 (SPARC).
>
>We haven't had any complaints from URI or NYCC, but I will contact
>them to follow up.
>
>Meanwhile, more strange entries in our log that we see from time to time:
>looks like some problem with the product queue, associated with
>our NMC2 feeds:
>
>Apr 24 09:20:23 redwood nmc2_3[19191]: FEEDME(nmc2_3): OK
>Apr 24 09:20:23 redwood nmc2_0[19188]: FEEDME(nmc2_0): OK
>Apr 24 09:20:23 redwood nmc2_1[19189]: run_requester: 20020424091825.606
>TS_ENDT {{NMC2,  "(.status|.*MT.mrf|.*MT.e
>nsg).*[16]$"}}
>Apr 24 09:20:23 redwood nmc2_1[19189]: FEEDME(nmc2_1): OK
>Apr 24 09:20:27 redwood nmc2_3[19191]: pq_del_oldest: conflict on
>734297512
>Apr 24 09:20:27 redwood nmc2_3[19191]: comings: pqe_new: Resource
>temporarily unavailable
>Apr 24 09:20:27 redwood nmc2_3[19191]:        :
>aebe6038a6265b9f150007fa34419767    89695 20020424091829.399    NMC
>2 073  /afs/.nwstg.
>Apr 24 09:20:27 redwood nmc2_3[19191]: Connection reset by peer
>Apr 24 09:20:27 redwood nmc2_3[19191]: Disconnect
>Apr 24 09:20:28 redwood nmc2_0[19188]: pq_del_oldest: conflict on
>734297512
>Apr 24 09:20:28 redwood nmc2_0[19188]: comings: pqe_new: Resource
>temporarily unavailable
>Apr 24 09:20:28 redwood nmc2_0[19188]:        :
>dedc129db615131ee7b1b395762b2657    89695 20020424091819.588    NMC
>2 140  /afs/.nwstg.
>Apr 24 09:20:28 redwood nmc2_0[19188]: Connection reset by peer
>Apr 24 09:20:28 redwood nmc2_0[19188]: Disconnect
>Apr 24 09:20:28 redwood nmc2_1[19189]: pq_del_oldest: conflict on
>734297512
>Apr 24 09:20:28 redwood nmc2_1[19189]: comings: pqe_new: Resource
>temporarily unavailable
>Apr 24 09:20:28 redwood nmc2_1[19189]:        :
>113f837e7a1db8593952513ba53e60dc    89695 20020424091828.213    NMC
>2 056  /afs/.nwstg.
>Apr 24 09:20:28 redwood nmc2_1[19189]: Connection reset by peer
>Apr 24 09:20:28 redwood nmc2_1[19189]: Disconnect
>Apr 24 09:20:28 redwood nmc2_2[19190]: pq_del_oldest: conflict on
>734297512
>Apr 24 09:20:28 redwood nmc2_2[19190]: comings: pqe_new: Resource
>temporarily unavailable
>Apr 24 09:20:28 redwood nmc2_2[19190]:        :
>ed92cb930d1c9032c0964c941502c77b    89695 20020424091828.571    NMC
>2 062  /afs/.nwstg.
>Apr 24 09:20:28 redwood nmc2_2[19190]: Connection reset by peer
>Apr 24 09:20:28 redwood nmc2_2[19190]: Disconnect
>Apr 24 09:20:30 redwood nmc2_4[19192]: pq_del_oldest: conflict on
>718812184
>Apr 24 09:20:30 redwood nmc2_4[19192]: comings: pqe_new: Resource
>temporarily unavailable
>Apr 24 09:20:30 redwood nmc2_4[19192]:        :
>d2d080a32b0065f9916cb53559a73f17    89695 20020424091833.465    NMC
>2 139  /afs/.nwstg.
>
>Also, what do these entries mean:
>
>Apr 24 08:43:40 redwood pqact[19184]: pbuf_flush 11: time elapsed
>2.100649
>Apr 24 08:43:45 redwood pqact[19184]: pbuf_flush 11: time elapsed
>2.072335
>Apr 24 08:43:55 redwood pqact[19184]: pbuf_flush 11: time elapsed
>2.186310
>Apr 24 08:44:01 redwood pqact[19184]: pbuf_flush 11: time elapsed
>2.003807
>Apr 24 08:44:06 redwood pqact[19184]: pbuf_flush 11: time elapsed
>2.175205
>
>(many more of these follow)
>
>I'd guess the following explanations are possible:
>
>1) Product queue is filling up with stuff from the NMC2 feed
>
>2) Machine can't keep up with the data stream (it is a rather old
>server)
>
>If we see output such as the above, does this indicate only a temporary
>problem with the queue, or should we assume it is corrupted and
>rebuild it?
>
>Here is output from pqmon: can you explain what we should look for here?
>
>Apr 24 15:31:26 pqmon: Starting Up (19755)
>Apr 24 15:31:26 pqmon: nprods nfree  nempty      nbytes  maxprods  maxfree  mi
> nempty    maxext  age
>Apr 24 15:31:26 pqmon: 107050    38  112638   867681744    142556     1270    
>  77168   1442584 22942
>Apr 24 15:31:26 pqmon: Exiting
>
>Thanks,
>
>Kevin
>
>
>______________________________________________________________________
>Kevin Tyle, Systems Administrator               **********************
>Dept. of Earth & Atmospheric Sciences           address@hidden
>University at Albany, ES-235                    518-442-4571 (voice)
>1400 Washington Avenue                          518-442-5825 (fax)
>Albany, NY 12222                                **********************
>______________________________________________________________________
>
>On Mon, 22 Apr 2002, Jeff Weber wrote:
>
>> Kevin,
>>
>> Nice work isolating those RPC calls!
>>
>> I have no idea why they do not die...we will investigate.
>>
>> What is your OS and LDM version?
>>
>>
>> Have URI or NYCC complained about the latencies in their feeds?
>>
>> It may be worth contacting them and determining what the status of their
>> data needs are..With this type of latency it may be best for them to
>> throttle back the feed and get what they want in a timely manner, or if
>> they are using the HDS/HRS feed in a non-real time environment, this may
>> be working for them (but I bet they are dropping products)...they may want
>> to subset the HDS and MCIDAS feeds, as they are the most voluminous.
>>
>> Thanks for the follow up,
>>
>> -Jeff
>