[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050407: LDM Error Messages in System Logs: What do they mean?



Angelo,

> >To: address@hidden
> >From: "angelo alvarez" <address@hidden>
> >Subject: LDM - LDM Error Messages in System Logs: What do they mean?
> >Organization: NPMOC/JTWC
> >Keywords: 200504072353.j37Nr0k4014186 LDM

The above message contained the following:

> Institution: NPMOC/JTWC
> Package Version:  6.0.15
> Operating System: Red Hat Linux 9.0
> Hardware Information: Dell Poweredge 1650
> Inquiry: Aloha.  I am seeing the following errors in our system logs:
> 
> Apr 07 15:04:31 wxmap2 oahu[542]: NOTICE: requester6.c:456; ldm_clnt.c:286: 
> nullproc_6 failure to oahu.npmoc.navy.mil; ldm_clnt.c:142: RPC: 
> Program/version mismatch; low version = 4, high version = 5

The above means that a downstream LDM on host "wxmap2" failed to
communicate with an upstream LDM on host "oahu" using the LDM-6 protocol
because the upstream LDM uses versions 4 and 5 of the protocol but not
version 6.  This can happen when the communications channel is initially
established and is not important because the downstream LDM-6 will
subsequently use the LDM-5 protocol to establish the connection.

Because LDM-6 is so much more efficient than LDM-5, however, it would be
to your advantage if the LDM on host "oahu" was upgraded.

> Apr 07 19:19:47 wxmap2 pqact[2263]: assertion "tvp->tv_sec >= TS_ZERO.tv_sec 
> && tvp->tv_usec >= TS_ZERO.tv_usec && tvp->tv_sec <= TS_ENDT.tv_sec && 
> tvp->tv_usec <= TS_ENDT.tv_usec" failed: file "pq.c", line 4935

The above means that the pq_cset() function of the "pq" module was
called with an invalid timestamp argument.  I don't know why.  I would
like to see a stack-trace of the core file if that's possible.

Normally, the LDM package is built with assertions turned off.  I take
it your're using an older version of the package.  The latest version has
had several bugs in the "pq" module fixed.

> Apr 07 19:19:47 wxmap2 pqexpire[2265]: assertion "IsFree(rght)" failed: file 
> "pq.c", line 1791

That assertion is way down in the bowels of the "pq" module and it would
take me some time to understand it.  It doesn't seem to affect the
latest LDM.  Can I interest you in that?

The pqexpire(1) program has given us problems in the past and unless you
use the "ldmadmin dostats" command or the ldmprods(1) utility, the
pqexpire(1) program is useless.  I strongly recommend removing or
commenting-out its entry in the LDM configuration-file, ldmd.conf.

You might have seen my advice to this effect a few weeks ago on the
ldm-users mailing-list.

> Apr 07 19:19:46 wxmap2 rpc.ldmd[2261]: Starting Up (version: 6.0.15; built: 
> Dec 22 2004 16:01:01)
> Apr 07 19:19:47 wxmap2 pqbinstats[2262]: Starting Up (2261)
> Apr 07 19:19:47 wxmap2 pqbinstats[2262]: assertion "tvp->tv_sec >= 
> TS_ZERO.tv_sec && tvp->tv_usec >= TS_ZERO.tv_usec && tvp->tv_sec <= 
> TS_ENDT.tv_sec && tvp->tv_usec <= TS_ENDT.tv_usec" failed: file "pq.c", line 
> 4935
> Apr 07 19:19:47 wxmap2 pqact[2263]: Starting Up
> Apr 07 19:19:47 wxmap2 pqact[2263]: assertion "tvp->tv_sec >= TS_ZERO.tv_sec 
> && tvp->tv_usec >= TS_ZERO.tv_usec && tvp->tv_sec <= TS_ENDT.tv_sec && 
> tvp->tv_usec <= TS_ENDT.tv_usec" failed: file "pq.c", line 4935
> Apr 07 19:19:47 wxmap2 pqexpire[2265]: Starting Up
> Apr 07 19:19:47 wxmap2 oahu[2267]: Starting Up(6.0.15): oahu.npmoc.navy.mil: 
> TS_ZERO TS_ENDT {{ANY,  ".*"}}
> Apr 07 19:19:47 wxmap2 oahu[2267]: assertion "tvp->tv_sec >= TS_ZERO.tv_sec 
> && tvp->tv_usec >= TS_ZERO.tv_usec && tvp->tv_sec <= TS_ENDT.tv_sec && 
> tvp->tv_usec <= TS_ENDT.tv_usec" failed: file "pq.c", line 4935
> Apr  7 09:19:47 wxmap2 su(pam_unix)[2215]: session closed for user ldm
> Apr 07 19:19:47 wxmap2 pqexpire[2265]: assertion "IsFree(rght)" failed: file 
> "pq.c", line 1791
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: child 2262 terminated by signal 6
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: Killing (SIGINT) process group
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: SIGINT
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: Terminating process group
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: child 2263 terminated by signal 6
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: Killing (SIGINT) process group
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: child 2265 terminated by signal 6
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: Killing (SIGINT) process group
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: child 2266 exited with status 127
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: child 2267 terminated by signal 6
> Apr 07 19:19:53 wxmap2 rpc.ldmd[2261]: Killing (SIGINT) process group

That assertion failure is showing up a lot.  At this point I suspect
that your LDM product-queue might be corrupt (i.e., in an inconsistent
state).  This can happen if the host system crashes.  Did it?

I recommend recreating the product-queue:

    ldmadmin stop
    ldmadmin delqueue
    ldmadmin mkqueue -f
    ldmadmin start

To recapitulate my advice:

    1.  Upgrade the LDM-s on wxmap2 and oahu to the latest version.

    2.  Removed the pqbinstats(1) entry from the LDM configuration-file.

    3.  Recreate the product-queue.

By the way, the "LDM Basics" webpages explain a lot of the LDM log
messages.  You can get to those pages from the LDM version-specific
hyperlink on the LDM homepage at

    http://www.unidata.ucar.edu/content/software/ldm

Regards,
Steve Emmerson
LDM Developer

> NOTE: All email exchanges with Unidata User Support are recorded in the
> Unidata inquiry tracking system and then made publicly available
> through the web.  If you do not want to have your interactions made
> available in this way, you must let us know in each email you send to us.