[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20021126: dmraob.k and dmsyn.k hanging on weather2; LDM memory leak (cont.)



>From: Gilbert Sebenste <address@hidden>
>Organization: NIU
>Keywords: 200210050242.g952g0127088 McIDAS-XCD DMRAOB DMSYN

Gilbert,

>Thanks for all the hard work! What is the diagnosis?

My opinion is that there is something wrong with either GCC 3.2, the C
libraries that are on weather2, something else in RedHat 8, or something
in your system configuration on weather2.

I had 'top' running on your system and a Solaris x86 system here at the
UPC as the clock ticked past 0Z so I could monitor the size of XCD
data monitors.  As soon as surface METAR and synoptic/ship/buoy data for a
new day started to arrive, the XCD data monitors DMSFC and DMSYN on
weather2 both more than doubled in size.  This is the situation that would
cause DMSYN to go into an infinite loop.  At the same time and while
receiving the same data, our Solaris x86 system showed no change in
size of either of these data monitors.  The same version of McIDAS-X,
-XCD was built using gcc/g77 on both your and our systems, but the GCC
version we are using on our x86 box is 2.95.3.  GCC on weather 2 is
3.2.

>Also, I notice that 
>pqact is taking up 21 MB of RAM, so it still makes me wonder if 
>something fishy isn't happening there.

You have to be careful when interpreting the sizes of LDM processes.
The reason is that their size will reflect the memory mapped LDM
queue.  For instance, the size for pqact as indicated by 'top' on our
x86 system is:

26725 ldm        1  58    0 1940M  460M sleep  11:13  0.56% pqact

This reflects the fact that the queue on our machine is 2 GB.

>Also, please note that I am running the non-bugfixed version of McIDAS on 
>weather.admin...and it doesn't hang there. However, I have 1 GB of memory 
>on that machine, vs. 500 MB on weather2. Maybe that could help provide a 
>clue?

Ah so...  This is probably telling us something very important.  I
talked with our system admin about what was happening on weather2, and
he mused that what we are seeing may be something that is isolated to
weather2 alone.  Your comment that unpatched -XCD on a different RH 8
system at NIU does not show hangs strongly suggests that there is
something fundamentally wrong with the OS installation on weather2.
Exactly what that may be I can't say.  It is "funny" (not ha ha) that
a routine trying to malloc a small (~ 82 KB) amount of memory on
weather2 sends the data monitor into an infinite loop even though
there is LOTS of swap space available ( > 0.5 GB).  This sort of
implies that there is somehing amiss with the swapping.  Again, what
it may be I can't say.

Each time dmsyn.k would hang on weather2, an examination of the core
file that is caused by sending the process a 'kill -ABRT' signal showed
that the routine that was in a tight loop was one that organizes memory
on behalf of malloc.  That routine can be found in /lib/libc.so.6.  It
was weird that this routine would hang especially when there was ample
swap space on disk that could have been used to swap things out of
memory.  I was suspicious of some sort of memory starvation on weather2
quite some time ago.  If you will remember our phone conversation, I
asked if it would be possible to put more memory in weather2 to see if
that wouldn't solve your problem.

For the record: all of the memory leaks I found in McIDAS C routines
could not have amounted to the increase in executable sizes that I was
seeing when the data monitors went off to create an MD file for a new
day's data.  In fact, the amount of memory that was used for this task
is only about one tenth the amount that the executable would grow to.
This remains a mystery to me.

Tom

>From address@hidden Tue Nov 26 11:50:59 2002
>Subject: Re: 20021126: dmraob.k and dmsyn.k hanging on weather2; LDM memory 
>leak (cont.) 

re: opinion is there is something wrong with either GCC 3.2, the C
libraries that are on weather2, something else in RedHat 8, or something
in your system configuration on weather2.

>I suspect GCC. My Weather2 RedHat installation was done from scratch,
>unlike the others which have been updated since 8.0 (not a "clean 
>install"), as they say.  Weather3 is getting old and is starting to fail,
>but weather is running just fine.

re: same version of data monitors on other machines don't grow; you are
using GCC 3.2

>Right.
 
re: interpreting size of LDM programs

OK.
 
re: XCD running on weather with no meltdown may be problem with weather2

>Or that the memory leaks were never big enough to overwhelm weather, but 
t>hey did on half the memory size on weather2?

re: memory leaks weren't big enough to cause size increases seen for
data monitors

>Well, we'll keep monitoring. In any case, I'll upgrade GCC again when the 
>next patch comes out. Thanks for the hard work and the trouble...I assume 
>these patches will be in place for all future versions.