[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #BPI-711373]: LDM segfaulted



Carissa,

It would help if I could get a dump of the stack from the core-file, 
/var/spool/abrt/ccpp-2013-06-30-21:45:03-23471. Can you get that to me?

Regardless, I recommend upgrading to the latest version of the LDM (6.11.6). It 
seems to not have the bug that caused the LDM 6.11.4 version to receive a 
SIGSEGV on an RHEL system.

The consistency of the product-queue can be checked via the pqcheck(1) utility 
(if no other process has the product-queue open for writing) and the pqcat(1) 
utility (be sure to redirect the standard output stream to /dev/null).

> Unidata,
> 
> We experienced a corrupt queue on 1 of our 4 supercomputer LDM feeds. We
> have gone to the admins who say there was no system issue around that
> time period and have pointed us back to the data being the issue.
> According to our admins the core dump indicates that LDM segfaulted at
> 21:45:03 (see below):
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    98963
> 20130630214502.648 NEXRAD2 293005
> L2-BZIP2/KBUF/20130630214449/293/5/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    53099
> 20130630214501.925 NEXRAD2 87045  L2-BZIP2/KILX/20130630214256/87/45/I/V06/0
> 
> We have this same LDM feed from all 4 systems, only 1 had an issue. I
> guess our main question is there any way to tell the difference if the
> corrupt queue was data related, or system related? I do notice that the
> core dump was put into a root directory, not the LDM home directory when
> we have decoder issues. Do you folks see any evidence of what might have
> caused this issue?
> 
> The log output is below.
> 
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    98963
> 20130630214502.648 NEXRAD2 293005
> L2-BZIP2/KBUF/20130630214449/293/5/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    53099
> 20130630214501.925 NEXRAD2 87045  L2-BZIP2/KILX/20130630214256/87/45/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    37749
> 20130630214503.004 NEXRAD2 550057
> L2-BZIP2/KHTX/20130630214031/550/57/I/V06/0
> Jun 30 21:45:03 t14d1 pqact[23461] INFO: [filel.c:297] Deleting closed
> FILE entry
> "/dcom/us007003/ldmdata/obs/upperair/nexrad_level2/KCBW/KCBW_20130630_214401.bz2"
> Jun 30 21:45:03 t14d1 kernel: ldmd[23471]: segfault at 2adea9675818 ip
> 00002ad5ac00be2a sp 00007fff7dc37400 error 4 in
> libldm.so.0.0.0[2ad5abff2000+52000]
> Jun 30 21:45:03 t14d1 sshd[18728]: Accepted publickey for dbnet from
> 140.90.100.184 port 52885 ssh2
> Jun 30 21:45:03 t14d1 sshd[18728]: pam_unix(sshd:session): session
> opened for user dbnet by (uid=0)
> Jun 30 21:45:03 t14d1 sshd[18795]: Accepted publickey for dbnet from
> 140.90.100.184 port 52886 ssh2
> Jun 30 21:45:03 t14d1 sshd[18795]: pam_unix(sshd:session): session
> opened for user dbnet by (uid=0)
> Jun 30 21:45:03 t14d1 abrt[18793]: Saved core dump of pid 23471
> (/gpfs/tmv/iodprod/dbnet/ldm/ldm-6.11.1/bin/ldmd) to
> /var/spool/abrt/ccpp-2013-06-30-21:45:03-23471 (1835008 bytes)
> Jun 30 21:45:03 t14d1 abrtd: Directory 'ccpp-2013-06-30-21:45:03-23471'
> creation detected
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: child 23471 terminated by signal 11
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: Killing (SIGTERM) process group
> Jun 30 21:45:03 t14d1 t10d2p.ncep.noaa.gov(feed)[20338] NOTE: Exiting
> Jun 30 21:45:03 t14d1 t14d2p.ncep.noaa.gov(feed)[26905] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: Exiting
> Jun 30 21:45:03 t14d1 outreach.aviationweather.noaa.go[23477] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23469] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23461] ERROR: fcntl F_RDLCK failed for rgn
> (0 SEEK_SET, 4096) 4: Interrupted system call
> Jun 30 21:45:03 t14d1 pqact[23461] NOTE: Exiting
> Jun 30 21:45:03 t14d1 205.156.51.46[23473] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] ERROR: fcntl F_RDLCK failed for rgn
> (0 SEEK_SET, 4096) 4: Interrupted system call
> Jun 30 21:45:03 t14d1 pqact[23460] ERROR: pq_sequence failed:
> Interrupted system call (errno = 4)
> Jun 30 21:45:03 t14d1 pqact[23460] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=11023,
> cmd="/nwprod/exec/decod_dcbthy -v 2 -t 480 -d
> /dcom/us007003/decoder_logsdecod_dcbthy.log /nwprod/fix/bufrtab.031"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26175,
> cmd="/nwprod/exec/decod_dcacar -v 2 -t 600 -d
> /dcom/us007003/decoder_logsdecod_dcacar.log
> /nwprod/fix/bufrtab.ARINC_ACARS /nwprod/fix/bufrtab.EUROPE_ACARS
> /nwprod/fix/bufrtab.CANADA_ACARS /nwprod/fix/bufrtab.FRANCE_ACARS
> /nwprod/fix/bufrtab.004"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26718,
> cmd="/nwprod/exec/decod_dcltng -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dcltng.log /nwprod/fix/bufrtab.007"
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23463] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=12452,
> cmd="/nwprod/exec/decod_dcdrbu -v 2 -t 365 -d
> /dcom/us007003/decoder_logsdecod_dcdrbu.log /nwprod/fix/bufrtab.001"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26112,
> cmd="/nwprod/exec/decod_dcmsfc -v 2 -t 480 -d
> /dcom/us007003/decoder_logsdecod_dcmsfc.log /nwprod/fix/bufrtab.001
> /nwprod/dictionaries/msfc.tbl /nwprod/dictionaries/tidg.tbl
> /nwprod/parm/decod_restricted.ship.headers"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=12552,
> cmd="/nwprod/exec/decod_dclsfc -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dclsfc.log /nwprod/fix/bufrtab.000
> /nwprod/dictionaries/lsfc.tbl /nwprod/parm/decod_WMO.Res40.headers"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26226,
> cmd="/nwprod/exec/decod_dcacft -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dcacft.log
> /nwprod/dictionaries/pirep.tbl /nwprod/dictionaries/airep.tbl
> /nwprod/fix/bufrtab.004"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26215,
> cmd="/nwprod/exec/decod_dcshef -v 2 -t 450 -d
> /dcom/us007003/decoder_logsdecod_dcshef.log /nwprod/parm/SHEFPARM
> /nwprod/dictionaries/shef.tbl /nwprod/fix/bufrtab.000
> /nwprod/fix/bufrtab.001 /nwprod/fix/bufrtab.255 A.E.G.P.R.S.T.U.X."
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26107,
> cmd="/nwprod/exec/decod_dcmetr -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dcmetr.log /nwprod/fix/bufrtab.000
> /nwprod/dictionaries/metar.tbl"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=7641, cmd="/nwprod/exec/decod_dcears
> -v 2 -t 450 -d /dcom/us007003/decoder_logs/ecod_dcears.log
> /nwprod/fix/bufrtab.EARS /nwprod/fix/bufrtab.021"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=1068, cmd="/nwprod/exec/decod_dcrocc
> -v 2 -t 600 -d /dcom/us007003/decoder_logs/ecod_dcrocc.log
> /nwprod/fix/bufrtab.003"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=25990,
> cmd="/nwprod/exec/decod_dcepfl -v 2 -t 450 -d
> /dcom/us007003/decoder_logsdecod_dcepfl.log
> /nwprod/fix/bufrtab.EUROPE_PROFILER /nwprod/fix/bufrtab.002"
> Jun 30 21:45:03 t14d1 205.156.51.46[23470] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23465] NOTE: Exiting
> Jun 30 21:45:03 t14d1 eldm.fsl.noaa.gov[23466] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23462] NOTE: Exiting
> Jun 30 21:45:03 t14d1 140.90.85.102[23475] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: Terminating process group
> Jun 30 21:45:03 t14d1 205.156.51.46[23472] INFO:    52428
> 20130630214503.125 NEXRAD2 514008
> L2-BZIP2/KRAX/20130630214439/514/8/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23472] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] NOTE: Behind by 0.219798 s
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: [uldb.c:1298] Entry for PID
> 23455 not found
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: [uldb.c:1909] Couldn't remove
> process from database
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 26905 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 20338 exited with status 0
> Jun 30 21:45:03 t14d1 pqact[23461] NOTE: Behind by 0.442305 s
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23477 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23469 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23473 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: child 23460 exited with status
> 1: pqact -f ANY-CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.conf
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23461 exited with status
> 0: pqact -f CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.craft
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23463 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23470 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23466 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23465 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23475 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23462 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23472 exited with status 0
> Jun 30 21:45:03 t14d1 abrtd: Executable
> '/gpfs/tmv/iodprod/dbnet/ldm/ldm-6.11.1/bin/ldmd' doesn't belong to any
> package
> Jun 30 21:45:03 t14d1 abrtd: 'post-create' on
> '/var/spool/abrt/ccpp-2013-06-30-21:45:03-23471' exited with 1
> Jun 30 21:45:03 t14d1 abrtd: Corrupted or bad directory
> /var/spool/abrt/ccpp-2013-06-30-21:45:03-23471, deleting
> 
> --
> Carissa Klemmer
> NCEP Central Operations
> Production Management Branch Dataflow Team
> 301-683-3835
> 
> 
> 
> More info left off the ticket.
> 
> We are running on RHEL 6.3
> LDM version - 6.11.1

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: BPI-711373
Department: Support LDM
Priority: Normal
Status: Closed