[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #BPI-711373]: LDM segfaulted



Carissa,

It would help if I could get a dump of the stack from the core-file, 
/var/spool/abrt/ccpp-2013-06-30-21:45:03-23471. Can you get that to me?

Regardless, I recommend upgrading to the latest version of the LDM (6.11.6). It 
seems to not have the bug that caused the LDM 6.11.4 version to receive a 
SIGSEGV on an RHEL system.

The consistency of the product-queue can be checked via the pqcheck(1) utility 
(if no other process has the product-queue open for writing) and the pqcat(1) 
utility (be sure to redirect the standard output stream to /dev/null).

> Unidata,
> 
> We experienced a corrupt queue on 1 of our 4 supercomputer LDM feeds. We
> have gone to the admins who say there was no system issue around that
> time period and have pointed us back to the data being the issue.
> According to our admins the core dump indicates that LDM segfaulted at
> 21:45:03 (see below):
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    98963
> 20130630214502.648 NEXRAD2 293005
> L2-BZIP2/KBUF/20130630214449/293/5/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    53099
> 20130630214501.925 NEXRAD2 87045  L2-BZIP2/KILX/20130630214256/87/45/I/V06/0
> 
> We have this same LDM feed from all 4 systems, only 1 had an issue. I
> guess our main question is there any way to tell the difference if the
> corrupt queue was data related, or system related? I do notice that the
> core dump was put into a root directory, not the LDM home directory when
> we have decoder issues. Do you folks see any evidence of what might have
> caused this issue?
> 
> The log output is below.
> 
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    98963
> 20130630214502.648 NEXRAD2 293005
> L2-BZIP2/KBUF/20130630214449/293/5/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    53099
> 20130630214501.925 NEXRAD2 87045  L2-BZIP2/KILX/20130630214256/87/45/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23471] INFO:    37749
> 20130630214503.004 NEXRAD2 550057
> L2-BZIP2/KHTX/20130630214031/550/57/I/V06/0
> Jun 30 21:45:03 t14d1 pqact[23461] INFO: [filel.c:297] Deleting closed
> FILE entry
> "/dcom/us007003/ldmdata/obs/upperair/nexrad_level2/KCBW/KCBW_20130630_214401.bz2"
> Jun 30 21:45:03 t14d1 kernel: ldmd[23471]: segfault at 2adea9675818 ip
> 00002ad5ac00be2a sp 00007fff7dc37400 error 4 in
> libldm.so.0.0.0[2ad5abff2000+52000]
> Jun 30 21:45:03 t14d1 sshd[18728]: Accepted publickey for dbnet from
> 140.90.100.184 port 52885 ssh2
> Jun 30 21:45:03 t14d1 sshd[18728]: pam_unix(sshd:session): session
> opened for user dbnet by (uid=0)
> Jun 30 21:45:03 t14d1 sshd[18795]: Accepted publickey for dbnet from
> 140.90.100.184 port 52886 ssh2
> Jun 30 21:45:03 t14d1 sshd[18795]: pam_unix(sshd:session): session
> opened for user dbnet by (uid=0)
> Jun 30 21:45:03 t14d1 abrt[18793]: Saved core dump of pid 23471
> (/gpfs/tmv/iodprod/dbnet/ldm/ldm-6.11.1/bin/ldmd) to
> /var/spool/abrt/ccpp-2013-06-30-21:45:03-23471 (1835008 bytes)
> Jun 30 21:45:03 t14d1 abrtd: Directory 'ccpp-2013-06-30-21:45:03-23471'
> creation detected
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: child 23471 terminated by signal 11
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: Killing (SIGTERM) process group
> Jun 30 21:45:03 t14d1 t10d2p.ncep.noaa.gov(feed)[20338] NOTE: Exiting
> Jun 30 21:45:03 t14d1 t14d2p.ncep.noaa.gov(feed)[26905] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: Exiting
> Jun 30 21:45:03 t14d1 outreach.aviationweather.noaa.go[23477] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23469] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23461] ERROR: fcntl F_RDLCK failed for rgn
> (0 SEEK_SET, 4096) 4: Interrupted system call
> Jun 30 21:45:03 t14d1 pqact[23461] NOTE: Exiting
> Jun 30 21:45:03 t14d1 205.156.51.46[23473] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] ERROR: fcntl F_RDLCK failed for rgn
> (0 SEEK_SET, 4096) 4: Interrupted system call
> Jun 30 21:45:03 t14d1 pqact[23460] ERROR: pq_sequence failed:
> Interrupted system call (errno = 4)
> Jun 30 21:45:03 t14d1 pqact[23460] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=11023,
> cmd="/nwprod/exec/decod_dcbthy -v 2 -t 480 -d
> /dcom/us007003/decoder_logsdecod_dcbthy.log /nwprod/fix/bufrtab.031"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26175,
> cmd="/nwprod/exec/decod_dcacar -v 2 -t 600 -d
> /dcom/us007003/decoder_logsdecod_dcacar.log
> /nwprod/fix/bufrtab.ARINC_ACARS /nwprod/fix/bufrtab.EUROPE_ACARS
> /nwprod/fix/bufrtab.CANADA_ACARS /nwprod/fix/bufrtab.FRANCE_ACARS
> /nwprod/fix/bufrtab.004"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26718,
> cmd="/nwprod/exec/decod_dcltng -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dcltng.log /nwprod/fix/bufrtab.007"
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23463] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=12452,
> cmd="/nwprod/exec/decod_dcdrbu -v 2 -t 365 -d
> /dcom/us007003/decoder_logsdecod_dcdrbu.log /nwprod/fix/bufrtab.001"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26112,
> cmd="/nwprod/exec/decod_dcmsfc -v 2 -t 480 -d
> /dcom/us007003/decoder_logsdecod_dcmsfc.log /nwprod/fix/bufrtab.001
> /nwprod/dictionaries/msfc.tbl /nwprod/dictionaries/tidg.tbl
> /nwprod/parm/decod_restricted.ship.headers"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=12552,
> cmd="/nwprod/exec/decod_dclsfc -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dclsfc.log /nwprod/fix/bufrtab.000
> /nwprod/dictionaries/lsfc.tbl /nwprod/parm/decod_WMO.Res40.headers"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26226,
> cmd="/nwprod/exec/decod_dcacft -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dcacft.log
> /nwprod/dictionaries/pirep.tbl /nwprod/dictionaries/airep.tbl
> /nwprod/fix/bufrtab.004"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26215,
> cmd="/nwprod/exec/decod_dcshef -v 2 -t 450 -d
> /dcom/us007003/decoder_logsdecod_dcshef.log /nwprod/parm/SHEFPARM
> /nwprod/dictionaries/shef.tbl /nwprod/fix/bufrtab.000
> /nwprod/fix/bufrtab.001 /nwprod/fix/bufrtab.255 A.E.G.P.R.S.T.U.X."
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=26107,
> cmd="/nwprod/exec/decod_dcmetr -v 2 -t 300 -d
> /dcom/us007003/decoder_logsdecod_dcmetr.log /nwprod/fix/bufrtab.000
> /nwprod/dictionaries/metar.tbl"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=7641, cmd="/nwprod/exec/decod_dcears
> -v 2 -t 450 -d /dcom/us007003/decoder_logs/ecod_dcears.log
> /nwprod/fix/bufrtab.EARS /nwprod/fix/bufrtab.021"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=1068, cmd="/nwprod/exec/decod_dcrocc
> -v 2 -t 600 -d /dcom/us007003/decoder_logs/ecod_dcrocc.log
> /nwprod/fix/bufrtab.003"
> Jun 30 21:45:03 t14d1 pqact[23460] INFO: [filel.c:295] Deleting
> least-recently-used PIPE entry: pid=25990,
> cmd="/nwprod/exec/decod_dcepfl -v 2 -t 450 -d
> /dcom/us007003/decoder_logsdecod_dcepfl.log
> /nwprod/fix/bufrtab.EUROPE_PROFILER /nwprod/fix/bufrtab.002"
> Jun 30 21:45:03 t14d1 205.156.51.46[23470] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23465] NOTE: Exiting
> Jun 30 21:45:03 t14d1 eldm.fsl.noaa.gov[23466] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldm.madis-data.noaa.gov[23462] NOTE: Exiting
> Jun 30 21:45:03 t14d1 140.90.85.102[23475] NOTE: Exiting
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: Terminating process group
> Jun 30 21:45:03 t14d1 205.156.51.46[23472] INFO:    52428
> 20130630214503.125 NEXRAD2 514008
> L2-BZIP2/KRAX/20130630214439/514/8/I/V06/0
> Jun 30 21:45:03 t14d1 205.156.51.46[23472] NOTE: Exiting
> Jun 30 21:45:03 t14d1 pqact[23460] NOTE: Behind by 0.219798 s
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: [uldb.c:1298] Entry for PID
> 23455 not found
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: [uldb.c:1909] Couldn't remove
> process from database
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 26905 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 20338 exited with status 0
> Jun 30 21:45:03 t14d1 pqact[23461] NOTE: Behind by 0.442305 s
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23477 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23469 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23473 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] NOTE: child 23460 exited with status
> 1: pqact -f ANY-CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.conf
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23461 exited with status
> 0: pqact -f CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.craft
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23463 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23470 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23466 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23465 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23475 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23462 exited with status 0
> Jun 30 21:45:03 t14d1 ldmd[23455] INFO: child 23472 exited with status 0
> Jun 30 21:45:03 t14d1 abrtd: Executable
> '/gpfs/tmv/iodprod/dbnet/ldm/ldm-6.11.1/bin/ldmd' doesn't belong to any
> package
> Jun 30 21:45:03 t14d1 abrtd: 'post-create' on
> '/var/spool/abrt/ccpp-2013-06-30-21:45:03-23471' exited with 1
> Jun 30 21:45:03 t14d1 abrtd: Corrupted or bad directory
> /var/spool/abrt/ccpp-2013-06-30-21:45:03-23471, deleting
> 
> --
> Carissa Klemmer
> NCEP Central Operations
> Production Management Branch Dataflow Team
> 301-683-3835
> 
> 
> 
> More info left off the ticket.
> 
> We are running on RHEL 6.3
> LDM version - 6.11.1

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: BPI-711373
Department: Support LDM
Priority: Normal
Status: Closed


NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.