[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030409: ldm-6.0.10 issues under irix 6.5



Pete,

> To: address@hidden,
> To: address@hidden
> From: address@hidden (Pete Pokrandt)
> Subject: ldm-6.0.10 issues under irix 6.5
> Organization: University of Wisconsin

The above message contained the following:

> Built, and am running, ldm-6.0.10 on two linux machines (f5.aos.wisc,edu,
> and profhorn.aos.wisc.edu, mapmaker will be next), all good on that 
> front.
> 
> However, when I built on our SGI running irix 6.5.15m, using
> gcc compilers (freeware version 3.0.1), I get assertion failures 
> and core dumps. 

I'll try to duplicate your problem here on our IRIX 6.5 system.

> I'm going to have to fail back to 6.0.2 for now (which, after 
> core-dumping before I rebuilt the queue file the first time, 
> had been running ok.)
> 
> 
> Unidata support: here are some excerpts from the log files when 6.0.10
> crashes under irix.
> 
> This time, it just died, but did not dump core:
> 
> Apr 08 21:33:43 5Q:sunset zeus(feed)[1782]: topo:  zeus.lsc.vsc.edu DIFAX
> Apr 08 21:36:54 5Q:sunset kelvin[1991]: ldmprog_4: ldmping from 
> kelvin.ca.uky.edu
> Apr 08 21:43:02 5Q:sunset rpc.ldmd[1697]: child 1710 terminated by signal 9

What was process 1710?

Signal 9 is SIGKILL, which cannot be caught or ignored by a process and
is actually handled by the operating system on behalf of the "receiving"
process.  Because this signal isn't used by the LDM package, the only
way a process of the LDM package could be "sent" this signal is by an
outside source.

Who or what "sent" the SIGKILL to process 1710?

> Apr 08 21:43:15 3Q:sunset DCNLDN[1692]: nldninput(): no data within timeout 
> period: returning EOF
> Apr 08 21:43:15 3Q:sunset DCNLDN[1692]: nldninput(): NLDN read error
> Apr 08 21:43:19 5Q:sunset pqact[1699]: child 1692 exited with status 110
> Apr 08 21:46:54 5Q:sunset kelvin[2688]: ldmprog_4: ldmping from 
> kelvin.ca.uky.edu
> Apr 08 21:52:53 5Q:sunset rpc.ldmd[1697]: child 1685 terminated by signal 11

What was process 1685?

Signal 11 is SIGSEGV and indicates an attempt to access memory that
isn't in the address-space of the process.

The rest of the log entries are what I would expect.

> Apr 08 21:52:53 5Q:sunset rpc.ldmd[1697]: Killing (SIGINT) process group
> Apr 08 21:52:53 5Q:sunset rpc.ldmd[1697]: SIGINT
> Apr 08 21:52:53 5Q:sunset mapmaker[1706]: SIGINT
> Apr 08 21:52:54 5Q:sunset mapmaker[1713]: SIGINT
> Apr 08 21:52:55 3Q:sunset mapmaker[1706]: pmap_unset(LDMPROG 300029, LDMVERS 
> 5) failed
> Apr 08 21:52:55 3Q:sunset mapmaker[1713]: pmap_unset(LDMPROG 300029, LDMVERS 
> 5) failed
> Apr 08 21:52:55 3Q:sunset mapmaker[1706]: pmap_unset(LDMPROG 300029, LDMVERS 
> 6) failed
> Apr 08 21:52:55 3Q:sunset mapmaker[1713]: pmap_unset(LDMPROG 300029, LDMVERS 
> 6) failed
> Apr 08 21:53:15 5Q:sunset rpc.ldmd[1697]: Terminating process group
> Apr 08 21:53:15 5Q:sunset mapmaker[1706]: SIGTERM
> Apr 08 21:53:15 5Q:sunset mapmaker[1713]: SIGTERM
> Apr 08 21:53:15 5Q:sunset pqbinstats[1701]: Interrupt
> Apr 08 21:53:15 5Q:sunset io(feed)[1757]: SIGTERM
> Apr 08 21:53:15 5Q:sunset pqact[1699]: Interrupt
> Apr 08 21:53:15 5Q:sunset f5(feed)[1750]: SIGTERM
> Apr 08 21:53:16 5Q:sunset io(feed)[1757]: SIGINT
> Apr 08 21:53:15 5Q:sunset zeus(feed)[1782]: SIGTERM
> Apr 08 21:53:16 5Q:sunset f5(feed)[1750]: SIGINT
> Apr 08 21:53:16 5Q:sunset pqbinstats[1701]: Exiting
> Apr 08 21:53:15 5Q:sunset shadow(feed)[1739]: SIGTERM
> Apr 08 21:53:15 5Q:sunset storm2(feed)[1743]: SIGTERM
> Apr 08 21:53:16 5Q:sunset kelvin(feed)[1763]: SIGTERM
> Apr 08 21:53:15 5Q:sunset accas(feed)[1746]: SIGTERM
> Apr 08 21:53:16 5Q:sunset zeus(feed)[1769]: SIGTERM
> Apr 08 21:53:16 5Q:sunset shadow(feed)[1739]: SIGINT
> Apr 08 21:53:16 5Q:sunset storm2(feed)[1743]: SIGINT
> Apr 08 21:53:16 5Q:sunset kelvin(feed)[1763]: SIGINT
> Apr 08 21:53:16 5Q:sunset accas(feed)[1746]: SIGINT
> Apr 08 21:53:16 5Q:sunset zeus(feed)[1769]: SIGINT
> Apr 08 21:53:16 5Q:sunset zeus(feed)[1782]: SIGINT
> Apr 08 21:53:16 5Q:sunset pqact[1699]: Exiting
> Apr 08 21:53:16 3Q:sunset pqact[1699]: mm0_mtof: Couldn't riul_r_find 0
> Apr 08 21:53:16 5Q:sunset io(feed)[1759]: SIGTERM
> Apr 08 21:53:16 5Q:sunset rtstats[1703]: Interrupt
> Apr 08 21:53:16 5Q:sunset io(feed)[1759]: SIGINT
> Apr 08 21:53:16 5Q:sunset rtstats[1703]: Exiting
> Apr 08 21:53:17 5Q:sunset f5[1711]: SIGTERM
> Apr 08 21:53:17 5Q:sunset f5[1711]: SIGINT
> Apr 08 21:53:18 5Q:sunset thelma[1707]: SIGTERM
> Apr 08 21:53:18 5Q:sunset thelma[1707]: SIGINT
> Apr 08 21:53:18 3Q:sunset thelma[1707]: pmap_unset(LDMPROG 300029, LDMVERS 5) 
> failed
> Apr 08 21:53:18 3Q:sunset thelma[1707]: pmap_unset(LDMPROG 300029, LDMVERS 6) 
> failed
> Apr 08 21:53:18 3Q:sunset f5[1711]: pmap_unset(LDMPROG 300029, LDMVERS 5) 
> failed
> Apr 08 21:53:18 3Q:sunset f5[1711]: pmap_unset(LDMPROG 300029, LDMVERS 6) 
> failed

I'll put the rest of your email aside for now while I try to duplicate
your problem and await your answers to the above questions.

Regards,
Steve Emmerson