[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20000522: gempak problem update



Bill,

It seems that your LDM server is probably getting overloaded past what
it can keep up with for the decoders. Your machine may have symptoms
of being IO bound.

When you started to receive the CONDUIT data, the amount of IO on your system
increased dramatically. Even without decoding, the data still arrives in your 
data queue and, programs like pqexpire which maintain the LDM product queue
require a lot of IO resources (typically pqexpire runs every 5 minutes to scour
the old data out of the queue). Previously, your file corruptions may have been
less frequent- though you still may be periodically pushing the system to
the limits.

When the pqact program has to send the data to a decoder, it sends it to
an open PIPE to a decoder, but if the decoder cannot write the data out to disk 
fast enough, then the PIPE will fill up- at this point a second PIPE will be 
opened 
which will cause a second instance of the decoder to start and write to the
same file as the other decoder- this will corrupt the data file.

If you have the program "top" installed on your system, you can see the
cpu usage on the system like:

CPU: 94.1% idle,  1.5% usr,  2.0% ker,  1.5% wait,  0.0% xbrk,  1.0% intr

When disk IO and/or swapping are taking a lot of time, you will see the
value for "wait" much higher, possibly 60 to 90%. That means that
the system is not doing anything with all its CPU power. Instead, it
is sitting waiting for data to be written to disk.

Since pqexpire has to wake up and run through the entire queue every 5 minutes,
it causes a lot of IO paging the queue in memory. One temporary solution here
is to increase the interval for pqexpire, such as 20 minutes. As the queue that 
you
create on your system becomes bigger, it will take longer for pqexpire to run
through it (LDM 5.1 which is being worked on currently will eliminate pqexpire
since it is such a load ....we will try to have that release out before the 
fall 
workshops). By increasing the -i interval to pqexpire, you will need a little 
more
queue space, but this is traded off with less frequent pqexpire paging of the 
entire
product queue. Be sure to check your ldmd.log files for other evidence of an 
overloded
system qith "deleting oldest product" and "pbuf write" failures.

Since you said that you saw corruption even when CONDUIT had stopped coming in,
it still could be that the increased product queue size is taxing your system.
If you didn't increase your product queue for CONDUIT, then you would probably
see a lot of "deleting oldest product" messages in your LDM logs. This would
signal that the queue is too small for the amount of data arriving and really 
load up the system. Other changes in the ata stream last week were additions
to the AVN on the NOAAPORT data stream out to 120 hours which additionally
increases the amount of data you receive.

My best suggestion is to determine if in fact your system is IO bound.
If you are still using an IRIX workstation, you can use the "sar" command to
check the busy IO load, such as "sar -d 5 100". You might need to look at this 
at different times, such as when pqexpire is actively running.

Steve Chiswell
Unidata User Support








>From: address@hidden (William Gallus)
>Organization: UCAR/Unidata
>Keywords: 200005222108.e4ML8PT07077

>Steve,
>
>I assume Don passed on my email last Friday to you.  After emailing
>him, we decided to try one more thing.  I mentioned that in the past,
>when our .gem files were corrupted, we had found that if we moved all
>of them to a separate subdirectory, the newer data coming in would
>NOT be corrupted.  We had tried this last Tuesday, but it didn't work.
>We realized on Friday that when we tried that fix, our CONDUIT data
>was continuing to come in.  Geff turned off the CONDUIT data on Thursday,
>thinking that was the problem, but that still did not fix the problem.
>We realized on Friday that maybe we should do both -- have CONDUIT
>turned off AND move all existing .gem files to an "old" subdirectory.
>After doing this, it appears our problem disappeared.
>
>Now.......this seems like a very crude band-aid solution.  I'd like
>to get at the root of the problem, especially since I would like to
>ingest the CONDUIT data.  
>
>It seems like the creation of new gempak files is dependent somewhat
>on older ones.  In other words, when we've had corrupted data, we've
>had to get rid of all existing gempak data to stop the corruption.
>Should that be the case?   What can we do to truly fix the problem?
>I can't recall the details of the other 2 or 3 times this has happened
>in the past 9 months or so, but this time around, it happened within
>a few days of getting the CONDUIT stream started, and it seemed to
>happen over the weekend when our CONDUIT data temporarily had
>stopped coming in.
>
>Bill
>**********************************************
>*  Bill Gallus                               *
>*  Asst. Prof. of Meteorology                *
>*  Dept. of Geological and Atmospheric Sci.  *
>*  3025 Agronomy Hall                        *
>*  Iowa State University                     *
>*  Ames, IA 50011                            *
>*  (phone) 515-294-2270                      *
>*  (fax)   515-294-2619                      *
>**********************************************
>