[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030130: LDM failure to start McIDAS-XCD



>From: William C Klein <address@hidden>
>Organization: Valparaiso
>Keywords: 200301301738.h0UHci615298 LDM McIDAS-XCD

Bill,

>I followed your steps, but I'm not seeing the monitors...
>
>[ aeolus : root : ~ ]
>[ 3 ] > su - ldm
>Sun Microsystems Inc.   SunOS 5.8       Generic Patch   October 2001
>You have new mail.
>
>[ aeolus : ldm : ~ ]
>[ 1 ] > which xcd_run
>/usr/local/ldm/decoders/xcd_run
>
>[ aeolus : ldm : ~ ]
>[ 2 ] > ldmadmin stop
>stopping the LDM server...
>LDM server stopped
>
>[ aeolus : ldm : ~ ]
>[ 3 ] > ps -u ldm
>   PID TTY      TIME CMD
> 27366 pts/5    0:00 tcsh
>
>[ aeolus : ldm : ~ ]
>[ 4 ] > ldmadmin start
>starting the LDM server...
>the LDM server has been started
>
>[ aeolus : ldm : ~ ]
>[ 5 ] > ps -u ldm
>   PID TTY      TIME CMD
> 27366 pts/5    0:00 tcsh
> 27598 ?        0:00 rpc.ldmd
> 27590 ?        0:00 pqact
> 27595 ?        0:01 rpc.ldmd
> 27591          0:00 <defunct>
> 27596 ?        0:01 rpc.ldmd
> 27593 ?        0:00 pqbinsta
> 27592 ?        0:00 pqact
> 27594 ?        0:00 rtstats
> 27589 ?        0:00 rpc.ldmd
> 27597 ?        0:00 rpc.ldmd
>
>[ aeolus : ldm : ~ ]
>[ 6 ] >

OK.  I logged onto aeolus and took a look around.  I see from the
XCD startup log, /home/mcidas/workdata/XCD_START.LOG, that something
has changed in terms of the development envionment on your system:

more XCD_START.LOG
Starting MONITOR at 03030.222311
ld.so.1: startxcd.k: fatal: relocation error: file startxcd.k: symbol __s_rsFe_p
v: referenced symbol not found
Starting DDS at 03030.222315
ld.so.1: ingetext.k: fatal: relocation error: file ingetext.k: symbol __s_rsFe_p
v: referenced symbol not found
 ...

The first entry shows that 'xcd_run MONITOR' was trying to startup the
McIDAS-XCD supervisor routine, startxcd.k, but ld.so.1 couldn't find
the symbol __s_rsFe_pv.  I suspect that the development envionment/OS
was upgraded (with patches?) recently, and that some library is either
missing or is no longer compatible with the McIDAS executables.

The next entry shows the same thing for the ingetext.k process.  This
gets started by an 'xcd_run' entry in ~ldm/etc/pqact.conf.  Since the
executable won't run, it didn't get started.  Since it was not running
when the next product came in, the pqact.conf entry for 'xcd_run' tried
to start it again.  This happened over and over again until your
system ran out of shared memory.  This is indicated by entries down
further in /home/mcidas/workdata/XCD_START.LOG:

Starting DDS at 03030.222353
ingetext.k: Cannot make positive UC: could not create 384300-byte shared memory 
segment

After seeing these errors, I decided to try and rebuild the McIDAS executables
to see if that would fix the problem.  I was unable to do this on aeolus
since C and Fortran compilers were unable to contact the license manager.
Here is an example of the output while trying to compile one C routine:

[ aeolus : mcidas : ~/mcidas2002/src ]
[ 41 ] > cc -c mct.c

License Error : Cannot connect to the license server (neptune)..
        for product(Sun WorkShop Compiler C SPARC).
        (License server may not have been started)
Cannot connect to license server
 The server (lmgrd) has not been started yet, or
 the wrong port@host or license file is being used, or the
 port or hostname in the license file has been changed.
Feature:workshop.c.sparc
Server name:neptune
FLEXlm error:-15,12.  System Error: 146    Connection refused
cc: acomp failed for mct.c

So, your McIDAS executables are no longer executable on aeolus, and
new ones are not able to be built since the compilers can't connect to
the license manager.

Something was changed on your system.

After you straighten out your development environment, you need to do the
following:

<login as 'mcidas'>
cd mcidas2002/src
make all

make install.bin install.xcd bin

Then:

<login as 'ldm'>
ldmadmin stop
<wait to make sure all LDM processes have exited>
ldmadmin start

I left the LDM off since each time it is started, your system will run
out of shared memory because the McIDAS routines won't run.  Each time
this happens, all of the allocated and orphaned shared memory segments
needs to be need to be removed (using ipcrm) and the corresponding
directories in ~ldm/.mctmp need to be fully removed.

Tom