[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #QPB-559054]: LDM 6.13.0 crashed on noaaport1/noaaport2.cod.edu simultaneously



Gilbert,

Assuming the file "/dev/shm/ldm.pq" is your LDM product-queue, then did it 
exist after the crash? I'm trying to determine if it was truly deleted by 
someone or something.

> Hello Steve,
> 
> I am filing this bug report on behalf of the College of DuPage, so that
> this gets seen and reported.
> 
> Late last night, LDM 6.13.0 crashed on both noaaport1 and
> noaaport2.cod.edu at the exact same time. This has happened
> before, but with the more verbose logging on 6.13.0,
> I hope this helps you.
> 
> No core file was dumped; all we have is this:
> 
> noaaport1:~/var/logs> more ldmd.log.1
> 20160507T053933.855495Z climate.cod.edu(feed)[55695] NOTE 
> ldmd.c:185:cleanup() Exiting
> 20160507T053933.856055Z ldmd[54126] NOTE ldmd.c:168:reap() child 55695 exited 
> with status 3
> 20160507T054041.326482Z climate.cod.edu(feed)[56294] NOTE  
> up6.c:445:up6_run() Starting Up(6.13.0/6): 20160507050605.301849 TS_ENDT  
> {{NOTHER|NGRAPH|NGRID|NIMAGE|WMO, ".*"}},  
> SIG=de84ecc6aa5f8753274343653df0646b, Primary
> 20160507T054041.326534Z climate.cod.edu(feed)[56294] NOTE up6.c:448:up6_run() 
> topo:  climate.cod.edu {{NOTHER|NGRAPH|NGRID|NIMAGE|WMO, (.*)}}
> 20160507T073120.499459Z ldmd[54126] NOTE ldmd.c:122:reap() child 54128  
> terminated by signal 6: noaaportIngester -m 224.0.1.1
> 20160507T073120.499491Z ldmd[54126] NOTE ldmd.c:148:reap() Killing  (SIGTERM) 
> process group
> 20160507T073120.501136Z atlas.cod.edu(feed)[54285] NOTE  ldmd.c:185:cleanup() 
> Exiting
> 20160507T073120.503248Z atlas.cod.edu(feed)[54226] NOTE  ldmd.c:185:cleanup() 
> Exiting
> 20160507T073120.507190Z weather.cod.edu(feed)[54214] NOTE  
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.511172Z rtstats[54138] NOTE rtstats.c:134:cleanup() Exiting
> 20160507T073120.515191Z climate.cod.edu(feed)[56294] NOTE  
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.515262Z ldmd[54126] NOTE ldmd.c:185:cleanup() Exiting
> 20160507T073120.515304Z ldmd[54126] NOTE ldmd.c:256:cleanup() Terminating 
> process group
> 20160507T073120.539213Z climate.cod.edu(feed)[55696] NOTE 
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.543203Z cdstats.cod.edu(feed)[54761] NOTE 
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.547191Z weather.cod.edu(feed)[54218] NOTE 
> ldmd.c:185:cleanup() Exiting
> 
> As you can see, a benign log file with little activity and then boom!
> Down it goes. But then they also got this alarm message:
> 
> Return-Path: <address@hidden>
> X-Original-To: ldm
> Delivered-To: address@hidden
> Received: by noaaport2.cod.edu (Postfix, from userid 1000)
> id 6688948DE; Thu, 14 Apr 2016 07:30:03 +0000 (UTC)
> From: address@hidden (Cron Daemon)
> To: address@hidden
> Subject: Cron <ldm@noaaport2> /bin/bash -l -c '/home/ldm/bin/ldmadmin 
> addmetrics'
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> X-Cron-Env: <SHELL=/bin/sh>
> X-Cron-Env: <HOME=/home/ldm>
> X-Cron-Env: <PATH=/usr/bin:/bin>
> X-Cron-Env: <LOGNAME=ldm>
> Message-Id: <address@hidden>
> Date: Thu, 14 Apr 2016 07:30:03 +0000 (UTC)
> 
> 20160414T073002.135546Z pqmon[1073] ERROR pqmon.c:main():337 pq_open failed: 
> /dev/shm/ldm.pq: No such file or directory
> 
> Note: AllisonHouse.com's satellite ingester, also running 6.13.0,
> did NOT have this same issue.
> 
> It's as if the product queue file suddenly disappeared.
> Nothing in crontab scours anything like it out; see below:
> 
> #
> # m h  dom mon dow   command
> # LDM Metrics.
> */5 * * * * /bin/bash -l -c '/home/ldm/bin/ldmadmin addmetrics'
> 0 3 * * * /bin/bash -l -c '/home/ldm/bin/ldmadmin newmetrics'
> #
> # Check on the LDM, make sure it is running.
> #1,16,31,46 * * * * /bin/bash -l -c '/home/ldm/bin/ldmadmin check >/dev/null'
> #*/30 * * * * /bin/bash -l -c 'bin/ldmadmin check >check.log 2>&1' || 
> /usr/bin/mail -s '"ldmadmin check" problem LDM may not be running on 
> noaaport.cod.edu' address@hidden
> #1 0,6,12,18 * * * /bin/bash -l -c 'bin/ldmadmin check >check.log 2>&1' ||
> /usr/bin/mail -s '"ldmadmin check" problem LDM may not be running on 
> noaaport.cod.edu' address@hidden
> 
> # Rotate logs.
> 20 19 * * * bash -l -c 'ldmadmin newlog'
> 
> # NOAAport signal check.
> #1 0,6,12,18 * * * /bin/bash -l -c 'wasReceived -f "WMO|NIMAGE|NEXRAD3" -o 
> 180' || /usr/bin/mail -s 'NOAAPORT data has not been received in the last 3 
> minutes via the dish' address@hidden
> 1,31 * * * * /bin/bash -l -c 'wasReceived -f "WMO|NIMAGE|NEXRAD3" -o 180' || 
> /usr/bin/mail -s 'NOAAPORT data has not been received in the last 3 minutes 
> via the dish' address@hidden
> 5,35 * * * * /bin/bash -l -c 'wasReceived -f "NGRID" -o 1800' || 
> /usr/bin/mail -s 'NOAAPORT NGRID data has not been received in the last 30 
> minutes via the dish' address@hidden
> 
> (Shrug)

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: QPB-559054
Department: Support LDM
Priority: Normal
Status: Closed