[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #QPB-559054]: LDM 6.13.0 crashed on noaaport1/noaaport2.cod.edu simultaneously



Gilbert,

Assuming the file "/dev/shm/ldm.pq" is your LDM product-queue, then did it 
exist after the crash? I'm trying to determine if it was truly deleted by 
someone or something.

> Hello Steve,
> 
> I am filing this bug report on behalf of the College of DuPage, so that
> this gets seen and reported.
> 
> Late last night, LDM 6.13.0 crashed on both noaaport1 and
> noaaport2.cod.edu at the exact same time. This has happened
> before, but with the more verbose logging on 6.13.0,
> I hope this helps you.
> 
> No core file was dumped; all we have is this:
> 
> noaaport1:~/var/logs> more ldmd.log.1
> 20160507T053933.855495Z climate.cod.edu(feed)[55695] NOTE 
> ldmd.c:185:cleanup() Exiting
> 20160507T053933.856055Z ldmd[54126] NOTE ldmd.c:168:reap() child 55695 exited 
> with status 3
> 20160507T054041.326482Z climate.cod.edu(feed)[56294] NOTE  
> up6.c:445:up6_run() Starting Up(6.13.0/6): 20160507050605.301849 TS_ENDT  
> {{NOTHER|NGRAPH|NGRID|NIMAGE|WMO, ".*"}},  
> SIG=de84ecc6aa5f8753274343653df0646b, Primary
> 20160507T054041.326534Z climate.cod.edu(feed)[56294] NOTE up6.c:448:up6_run() 
> topo:  climate.cod.edu {{NOTHER|NGRAPH|NGRID|NIMAGE|WMO, (.*)}}
> 20160507T073120.499459Z ldmd[54126] NOTE ldmd.c:122:reap() child 54128  
> terminated by signal 6: noaaportIngester -m 224.0.1.1
> 20160507T073120.499491Z ldmd[54126] NOTE ldmd.c:148:reap() Killing  (SIGTERM) 
> process group
> 20160507T073120.501136Z atlas.cod.edu(feed)[54285] NOTE  ldmd.c:185:cleanup() 
> Exiting
> 20160507T073120.503248Z atlas.cod.edu(feed)[54226] NOTE  ldmd.c:185:cleanup() 
> Exiting
> 20160507T073120.507190Z weather.cod.edu(feed)[54214] NOTE  
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.511172Z rtstats[54138] NOTE rtstats.c:134:cleanup() Exiting
> 20160507T073120.515191Z climate.cod.edu(feed)[56294] NOTE  
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.515262Z ldmd[54126] NOTE ldmd.c:185:cleanup() Exiting
> 20160507T073120.515304Z ldmd[54126] NOTE ldmd.c:256:cleanup() Terminating 
> process group
> 20160507T073120.539213Z climate.cod.edu(feed)[55696] NOTE 
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.543203Z cdstats.cod.edu(feed)[54761] NOTE 
> ldmd.c:185:cleanup() Exiting
> 20160507T073120.547191Z weather.cod.edu(feed)[54218] NOTE 
> ldmd.c:185:cleanup() Exiting
> 
> As you can see, a benign log file with little activity and then boom!
> Down it goes. But then they also got this alarm message:
> 
> Return-Path: <address@hidden>
> X-Original-To: ldm
> Delivered-To: address@hidden
> Received: by noaaport2.cod.edu (Postfix, from userid 1000)
> id 6688948DE; Thu, 14 Apr 2016 07:30:03 +0000 (UTC)
> From: address@hidden (Cron Daemon)
> To: address@hidden
> Subject: Cron <address@hidden> /bin/bash -l -c '/home/ldm/bin/ldmadmin 
> addmetrics'
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> X-Cron-Env: <SHELL=/bin/sh>
> X-Cron-Env: <HOME=/home/ldm>
> X-Cron-Env: <PATH=/usr/bin:/bin>
> X-Cron-Env: <LOGNAME=ldm>
> Message-Id: <address@hidden>
> Date: Thu, 14 Apr 2016 07:30:03 +0000 (UTC)
> 
> 20160414T073002.135546Z pqmon[1073] ERROR pqmon.c:main():337 pq_open failed: 
> /dev/shm/ldm.pq: No such file or directory
> 
> Note: AllisonHouse.com's satellite ingester, also running 6.13.0,
> did NOT have this same issue.
> 
> It's as if the product queue file suddenly disappeared.
> Nothing in crontab scours anything like it out; see below:
> 
> #
> # m h  dom mon dow   command
> # LDM Metrics.
> */5 * * * * /bin/bash -l -c '/home/ldm/bin/ldmadmin addmetrics'
> 0 3 * * * /bin/bash -l -c '/home/ldm/bin/ldmadmin newmetrics'
> #
> # Check on the LDM, make sure it is running.
> #1,16,31,46 * * * * /bin/bash -l -c '/home/ldm/bin/ldmadmin check >/dev/null'
> #*/30 * * * * /bin/bash -l -c 'bin/ldmadmin check >check.log 2>&1' || 
> /usr/bin/mail -s '"ldmadmin check" problem LDM may not be running on 
> noaaport.cod.edu' address@hidden
> #1 0,6,12,18 * * * /bin/bash -l -c 'bin/ldmadmin check >check.log 2>&1' ||
> /usr/bin/mail -s '"ldmadmin check" problem LDM may not be running on 
> noaaport.cod.edu' address@hidden
> 
> # Rotate logs.
> 20 19 * * * bash -l -c 'ldmadmin newlog'
> 
> # NOAAport signal check.
> #1 0,6,12,18 * * * /bin/bash -l -c 'wasReceived -f "WMO|NIMAGE|NEXRAD3" -o 
> 180' || /usr/bin/mail -s 'NOAAPORT data has not been received in the last 3 
> minutes via the dish' address@hidden
> 1,31 * * * * /bin/bash -l -c 'wasReceived -f "WMO|NIMAGE|NEXRAD3" -o 180' || 
> /usr/bin/mail -s 'NOAAPORT data has not been received in the last 3 minutes 
> via the dish' address@hidden
> 5,35 * * * * /bin/bash -l -c 'wasReceived -f "NGRID" -o 1800' || 
> /usr/bin/mail -s 'NOAAPORT NGRID data has not been received in the last 30 
> minutes via the dish' address@hidden
> 
> (Shrug)

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: QPB-559054
Department: Support LDM
Priority: Normal
Status: Closed


NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.