[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20040623: LDM-6.0.14 assertion error under Fedora Core 2 (2.6 kernel)



>From: Unidata User Support <address@hidden>
>Organization: Unidata Program Center/UCAR
>Keywords: LDM Fedora Core 2 Linux

Hi Steve, David, and Waldenio,

David Garrana of UFRJ reported a problem he has been seeing on one of
his machines at UFRJ.  David's email is included -- with comments --
at the end of this message.

Here is the ldmd.log output from an ldm-6.0.14 system at UFRJ.  Several
notes follow:

- system running running Fedora Core 2:

$ uname -a
Linux cardeal.meteoro.ufrj.br 2.6.6-1.435smp #1 SMP Mon Jun 14 09:31:56 EDT 
2004 i686 athlon i386 GNU/Linux

- originally, the ~ldm/data directory was owned by ldm:opera, but the
  user:group for the user was ldm:ldm.  This indicates that the primary
  group for 'ldm' was different when ~ldm/data was made

- the permissions on 'ldm's HOME directory are 755 -- no group write
  permission.  I have to believe that this is important given the
  ownership on the ~ldm/data directory.

- LDM queue size was the default 400000000

- ldmadmin queuecheck showed a difference of 4 products in the queue
  and the number of products in the header (the header showed 4 more)

Here is the output from one LDM log file on cardeal:

Jun 23 20:45:49 cardeal rpc.ldmd[17204]: Starting Up (version: 6.0.14; built: 
Jun 21 2004 14:28:21) 
Jun 23 20:45:49 cardeal pqbinstats[17206]: Starting Up (17204) 
Jun 23 20:45:49 cardeal solon[17209]: Starting Up(6.0.14): 
solon.meteoro.ufrj.br: TS_ZERO TS_ENDT {{I
DS|DDPLUS,  ".*"}} 
Jun 23 20:45:49 cardeal rtstats[17208]: Starting Up (17204) 
Jun 23 20:45:49 cardeal pqact[17207]: Starting Up 
Jun 23 20:45:49 cardeal solon[17209]: Desired product class: 20040623194549.337 
TS_ENDT {{IDS|DDPLUS,
  ".*"}} 
Jun 23 20:45:49 cardeal solon[17209]: Connected to upstream LDM-6 
Jun 23 20:45:49 cardeal solon[17209]: Upstream LDM is willing to feed 
Jun 23 20:45:49 cardeal solon[17209]: assertion "rl->nempty <= 
RL_FREE_OVERHEAD" failed: file "pq.c",
 line 958 
Jun 23 20:45:55 cardeal rpc.ldmd[17204]: child 17209 terminated by signal 6 
Jun 23 20:45:55 cardeal rpc.ldmd[17204]: Killing (SIGINT) process group 
Jun 23 20:45:55 cardeal rpc.ldmd[17204]: SIGINT 
Jun 23 20:45:55 cardeal pqbinstats[17206]: Interrupt 
Jun 23 20:45:55 cardeal pqbinstats[17206]: Exiting 
Jun 23 20:45:55 cardeal pqact[17207]: Interrupt 
Jun 23 20:45:55 cardeal pqact[17207]: Exiting 
Jun 23 20:45:55 cardeal rtstats[17208]: Interrupt 
Jun 23 20:45:55 cardeal rtstats[17208]: Exiting 
Jun 23 20:45:55 cardeal rpc.ldmd[17204]: Terminating process group 

What I did to try and get around this problem was:

- login as 'ldm'
- cd ~ldm
- ldmadmin stop           <- not needed; LDM was not running, but this
                             does the needed 'ldmadmin clean'
- ls -alt data/*          <- the only thing in the directory was the
                             LDM queue
- mkdir data              <- now owned by ldm:ldm
- ldmadmin mkqueue        <- 400000000 queue size
- ldmadmin start

Relevant uncommented entries in ~ldmd.conf are:

[ldm@cardeal ldm]$ grep -v ^# etc/ldmd.conf
exec    "pqbinstats"
exec    "pqact"
exec    "rtstats -h rtstats.unidata.ucar.edu"
request IDS|DDPLUS      ".*"     solon.meteoro.ufrj.br PRIMARY
allow   ANY
    ^((localhost|loopback)|(127\.0\.0\.1\.?$)|([a-z].*\.unidata\.ucar\.edu\.?$))

The log file above showed that the LDM exited virtually immediately
after startup.  After cleanup, the LDM has been running for 17 minutes
while I write this message.

>Date: Wed, 23 Jun 2004 18:06:06 -0300 (EST)
>From: David Garrana Coelho <address@hidden>
>To: Waldenio Almeida <address@hidden>
>Cc: Tom Yoksas <address@hidden>
>Subject: Solon unstable?

>Hi Tom,

>       I'm trying for quite some time to write this e-mail, but since its
>a recurring problem, but not a "halt everything" one, it was always
>getting delayed.

I know the feeling...

>       For some time now, solon eventually denies connection to cptec,
>and as it seems, to brisa and a new machine in UFRJ as well.

I just want to point out that the log output you sent does not show
that solon 'eventually denies' a connection to cardeal.  The problem
is that the rpc.ldmd process on the receiving side (cardeal) exits.
The sending rpc.ldmd sees this condition and exits -- with no error
status.  If there had been an error exit, the LDM on solon would
have exited.

Now, different log messages on solon _do_ show that connections get
denied:

 ...
logs/ldmd.log.12:Jun 16 14:20:14 solon rpc.ldmd[3743]: Denying connection from 
200.9.149.140 
logs/ldmd.log.2:Jun 23 20:00:06 solon rpc.ldmd[86790]: Denying connection from 
150.163.141.196 
 ...
logs/ldmd.log.9:Jun 18 12:05:07 solon rpc.ldmd[44018]: Denying connection from 
150.163.146.249 

These denials show that DNS was not available for those machines when
the data request was made.  Since you are dealing with a small number
of sites feeding off of solon currently, I decided to add allows for
these specific machines by IP address to the ~ldm/etc/ldmd.conf file:

# Allows by IP in case DNS is not working
# 150.163.141.196 -> moingobe.cptec.inpe.br
# 150.163.146.249 -> monhane.cptec.inpe.br
# 200.9.149.140   -> ?
allow   ANY     ^150\.163\.141\.196$
allow   ANY     ^150\.163\.146\.249$
allow   ANY     ^200\.9\.149\.140$

So, if DNS is not available, these machines will be allowed by their
IP.  If DNS is available, they will be allowed by their fully qualified
hostname.

>I was(am)
>trying to feed GTS as input to (MM5) model initialization, and to do so, I
>compiled and installed ldm 6.0.14 on cardeal.meteoro.ufrj.br. Everything
>went well till last night, when a strange error appeared:

 -- log file output from cardeal included above --

>And the log no solon:
>
>Jun 23 20:45:49 solon cardeal(feed)[19141]: up6.c:331: Starting Up(6.0.14/6): 
>20040623194549.337 TS_ENDT {{IDS|DDPLUS,  ".*"}}
>Jun 23 20:45:49 solon cardeal(feed)[19141]: topo:  cardeal.meteoro.ufrj.br 
>IDS|DDPLUS
>Jun 23 20:45:49 solon cardeal(feed)[19141]: up6.c:168: HEREIS: RPC: Unable to 
>send; errno = Broken pipe
>Jun 23 20:45:49 solon cardeal(feed)[19141]: up6.c:396: Product send failure: 
>Input/output error
>Jun 23 20:45:55 solon rpc.ldmd[18937]: child 19141 exited with status 6

This log output shows that the process requesting data from solon
exited in the middle of a product.  The correct action at this point is
for the sending rpc.ldmd to exit, and it does.

>       Which in turn is pretty similar to the messages moingobe and
>master generate when fail to access solon.

Since I have login access to solon, I could get on and look at the
~ldm/logs/ldmd.log* files.  I found multiple instances of problems
sending data to moingobe.  Here is one example:

ldmd.log.9:Jun 17 18:48:00 solon moingobe(feed)[124]: topo:  
moingobe.cptec.inpe.br WMO 
ldmd.log.9:Jun 17 18:48:01 solon moingobe(feed)[114]: up6.c:168: HEREIS: RPC: 
Unable to send; errno = Broken pipe 

This is saying that the pipe between solon and moingobe went away, and
the feeding process on solon exited after it noticed this.  The most
likely cause of these is a dropped network connection between solon and
moingobe fails.  The good news here is that a new requesting process
will be started on moingobe, and it will request all data in each
datastream that are newer than the last ones it got.

>I tried to delqueue and
>mkqueue, clean, stop, start on both sides, and no effect. It simply
>(apparently) appeared out of thin air.

Since I was suspicious of the different ownership of the ~ldm/data
directory, I removed and remade the directory as 'ldm'.  The LDM has
now been running for an hour, and data continues to flow.

>Maybe its related to the issues
>moingobe & Co.  are having with solon, maybe not. Just extra input.

I believe that the two problems are not related since they have
different symptoms.

>       So, knowing you are very busy, can you please check the logs on
>solon and try to find out why moingobe is getting denied so often, and
>even local machines?

I did so right after seeing your note.  I will monitor the situation
on cardeal to see if data ingestion stops.

I must tell you that you are forging new territory here by running
the LDM under Fedora Core 2.  We have one machine running this release,
and we just started testing under it on Monday.

>Waldenio and me discussed some possible causes, but
>none seem plausible so far. No net stoppages or conf changes seems to be
>related to the issues.

OK.

>       BTW, you can log on cardeal.meteoro.ufrj.br, account ldm same
>password as others ldm accounts, to check if you need/wish.  Its a dual
>Athlon MP 2100+ running Fedora Core 2 (kernel 2.6), one of the nodes of
>our small cluster (which in turn is running South America MM5 model
>forecasts in real time, soon to be available to the public).

Being able to logon was very helpful, thanks.

By the way, in the future you should send LDM problems directly to our
Support box, address@hidden.  Since three of us read this
mailbox, you will be more likely to get a fast response than if you
send the note directly to me.

We should keep an eye on cardeal to see if the exiting problem
reoccurs.

>       Best regards,
>###########################################################
># David Garrana Coelho - address@hidden              #
># Pesquisador Associado ao                                #
># Laboratorio de Prognosticos em Mesoescala - UFRJ        #
># http://www.lpm.meteoro.ufrj.br                          #
># Meteorologista da Equipe do Alerta-Rio/GEORIO           #
># http://www2.rio.rj.gov.br/georio/site/alerta/alerta.htm #
>###########################################################

Cheers,

Tom
--
+-----------------------------------------------------------------------------+
* Tom Yoksas                                             UCAR Unidata Program *
* (303) 497-8642 (last resort)                                  P.O. Box 3000 *
* address@hidden                                   Boulder, CO 80307 *
* Unidata WWW Service                            http://www.unidata.ucar.edu/ *
+-----------------------------------------------------------------------------+