[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20001029: LDM 5.1.2 on solarisx86 not letting products out of queue (fwd)




===============================================================================
Robb Kambic                                Unidata Program Center
Software Engineer III                      Univ. Corp for Atmospheric Research
address@hidden             WWW: http://www.unidata.ucar.edu/
===============================================================================

---------- Forwarded message ----------
Date: Sun, 29 Oct 2000 09:56:28 -0700
From: Tom Yoksas <address@hidden>
To: Russ Rew <address@hidden>
     address@hidden, address@hidden,
     address@hidden
Subject: 20001029: LDM 5.1.2 on solarisx86 not letting products out of queue

>From: Tom Yoksas <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200010291355.e9TDt4406706 LDM 5.1.2 queue pqact


Russ,

First, thanks for the quick reply!

re: disk read/write on uni9 (dual 800 Mhz Pentium III with 108 GB
of 10,000 rmp SCSI disk) is pokey

Mike and I noticed this after he got Solaris x86 2.8 up on Friday
night.  It is apparently poor support in the driver for the new SCSI
interface that the machine has.  Bummer!  This has turned what should
be a screaming machine into a dog.

>> shemp, on the other hand, is is always a couple to three hours behind
>> on products getting out of the LDM queue for decoding.  This is
>> especially true for the image products from the Unidata-Wisconsin (LDM
>> MCIDAS feed type) stream.

>I just spent some time trying to see if I could discover any symptoms
>of problems with shemp, but it looks like the LDM is running fine.

The processes may be running, but MCIDAS products go into the queue and
then take hours to get out if at all.  I just looked at the times for
decoded GOES-West IR images, and the last one that got decoded was at
8Z; it is now 16Z.  I went to check on uni9, but I can't even ping it
from home.  motherlode, on the other hand, is current.

The problem is not limited to the imagery from the MCIDAS feed.
Textual product decoding has been several hours old for the
past several days.  I think that Chiz ran into the same thing in his
workshop, but his note to Mike suspected that the data wasn't getting
to shemp from jackie on time.  My comparison of 'ldmadmin watch -f
MCIDAS' and a tail on the ldm-mcidas log file showed the products
arrive and then nothing happens for a _long_ (i.e., hours) time.

>pqmon shows everything as expected with the queue algorithms.  Looking
>at shemp's pqbinstats files, the latencies for MCIDAS products all look
>small, with average latencies about 10 or 15 seconds and worst-case
latency of 509 seconds.

Right, the products get there just fine and in a timely manner, but they
don't seem to get out of the queue and into pqact's hands.

>I couldn't see any obvious problems in the
>pnga2area decoder logs either (is that the decoder that seems to be
>falling behind?)  The pnga2area decoder usually finished within a few
>of seconds of starting up.

The decoder seems to work fine once it gets fired up.  Also, it was
apparently working well until some time on Friday morning.

>The ldmd.log on shemp does show more of these sorts of messages than
>usual:  
>
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: pq_del_oldest: confl
> ict on 1785472176
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: comings: pqe_new: Re
> source temporarily unavailable
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]:        : 68c5b47e4b3
> a9c49959e63a570bb6127    21422 20001029131013.897    NMC2 174  /u/ftp/ga
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: Connection reset by 
> peer
> Oct 29 13:18:23 shemp.unidata.ucar.edu motherlode[7962]: Disconnect
>
>I don't recall seeing these at all during our earlier testing.  I
>think what this means is that a receiver process can't get a lock on
>the oldest product in the queue to delete it to make space for an
incoming product, because some other process, probably a sender
>process, still has a lock on that region.  If a sender died while it
>still had a lock on a product, it would never release it, so this
>might be a symptom of that.  But later messages refer to a different
>region, so the lock must have gotten released.  This may be a red
>herring, but I'll try to look at it more carefully to see exactly why
>it is occurring.

Going with the concept of a slow feedee, I can offer the following:
I saw the machine navier from Penn State connecting to shemp.  I also
recall seeing a message that navier was on a network that was having
problems.  Perhaps the two go together?  If so, the next question is
why is shemp feeding navier?  The question after that is can I shut it
off and see if shemp returns to the land of the living?

Tom
--
+-----------------------------------------------------------------------------+
* Tom Yoksas                                             UCAR Unidata Program *
* (303) 497-8642 (last resort)                                  P.O. Box 3000 *
* address@hidden                                   Boulder, CO 80307 *
* Unidata WWW Service                             http://www.unidata.ucar.edu/*
+-----------------------------------------------------------------------------+