[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #SAE-848662]: Fwd: LDM dies after couple days, can't restart



corepuncher,

> Hi thanks for taking my question.
> 
> I have a machine where LDM runs well, but only for a day or two. Then, it
> suddenly shuts off. Well...seemingly. There is no "pqact" or "noaaportinge"
> when I run "top", and data is not flowing.

The best ways to determine if data is flowing are "ldmadmin watch" and 
"notifyme -vl-".

> Just happened a few minutes ago. So I try to do an "ldm clean", and it says
> The LDM system is running, and to stop it first.
> 
> So I do ldmadmin stop, and I just get a perpetual:
> 
> Stopping the LDM server...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...
> Waiting for the LDM server to terminate...

It can take a while to stop an LDM system. If it doesn't stop withing a minute, 
however, then something's wrong.

> So ^C to stop it.
> 
> Here is the last thing shown in ldmd.log:
> 
> Mar  2 12:00:43 newton noaaportIngester[15971] ERROR: [gb22gem.c:74] [GB 1] 
> Couldn't get parameter values
> Mar  2 12:00:43 newton noaaportIngester[15971] ERROR: [gb2param.c:89] [GB -1] 
> Couldn't get parameter info: disc=0, cat=16, id=3, pdtn=0
> Mar  2 12:00:43 newton noaaportIngester[15971] ERROR: [gb22gem.c:74] [GB 1] 
> Couldn't get parameter values
> Mar  2 12:00:43 newton noaaportIngester[15971] WARN: Gap in packet sequence: 
> 1052210005 to 1052210549 [skipped 543]
> Mar  2 12:00:43 newton noaaportIngester[15971] ERROR: Missing fragment in 
> sequence, last 565/66075757 this 1109/66075757
> Mar  2 12:00:43 newton noaaportIngester[15971] WARN: Gap in packet sequence: 
> 1052210549 to 1052214590 [skipped 4040]
> Mar  2 12:00:43 newton noaaportIngester[15971] WARN: Gap in packet sequence: 
> 1052214590 to 1052214802 [skipped 211]

Aside from missing some GEMPAK GRIB2 table entries, this looks normal.

> I did look at the "ldm pid" file, and found the number.  Then, I went
> into TOP, and although I could not see it, I did a kill on that pid,
> and it worked!

A SIGINT sent to the top-level LDM server should stop the system quickly -- at 
the risk of corrupting the product-queue.

> So that gets it to restart, but doesn't explain why it stops suddenly.
> The crazy part is, I have another server, so 2 cords coming from Novra
> receiver.  The other machine never has this issue...so it must be a
> software issue?
> 
> From address@hidden  Mon Mar  2 11:38:29 2015
> 
> Actually, I take that back.  Even though it "seemed" to start after killing
> that PID listed in the file:
> 
> The product-queue is OK.
> Checking pqact(1) configuration-file(s)...
> /home/ldm/etc/pqact.conf: syntactically correct
> etc/pqact.gempak: syntactically correct
> etc/pqact.grlevelx: syntactically correct
> Checking LDM configuration-file (/home/ldm/etc/ldmd.conf)...
> Starting the LDM server...
> 
> Again, there is no pqact or noaaportinge process running under top. So
> alas, only thing I can do is reboot.
> 
> The log, after getting a "fake" ldm start, shows this:
> 
> Mar  2 12:35:37 pqact[518] NOTE: Starting from insertion-time 2015-03-02 
> 18:01:12.401276 UTC
> Mar  2 12:35:37 noaaportIngester[520] ERROR: Address already in use
> Mar  2 12:35:37 noaaportIngester[520] ERROR: [multicastReader.c:97] Couldn't 
> bind to port 1201
> Mar  2 12:35:37 noaaportIngester[520] ERROR: [noaaportIngester.c:340] 
> Couldn't create multicast-reader
> Mar  2 12:35:37 noaaportIngester[521] ERROR: Address already in use
> Mar  2 12:35:37 noaaportIngester[521] ERROR: [multicastReader.c:97] Couldn't 
> bind to port 1202
> Mar  2 12:35:37 noaaportIngester[521] ERROR: [noaaportIngester.c:340] 
> Couldn't create multicast-reader
> Mar  2 12:35:37 noaaportIngester[523] ERROR: Address already in use
> Mar  2 12:35:37 noaaportIngester[523] ERROR: [multicastReader.c:97] Couldn't 
> bind to port 1204
> Mar  2 12:35:37 noaaportIngester[523] ERROR: [noaaportIngester.c:340] 
> Couldn't create multicast-reader
> Mar  2 12:35:37 noaaportIngester[522] ERROR: Address already in use
> Mar  2 12:35:37 noaaportIngester[522] ERROR: [multicastReader.c:97] Couldn't 
> bind to port 1203
> Mar  2 12:35:37 noaaportIngester[522] ERROR: [noaaportIngester.c:340] 
> Couldn't create multicast-reader
> Mar  2 12:35:37 ldmd[516] NOTE: child 520 exited with status 1: 
> noaaportIngester -m 224.0.1.1 -I 10.0.0.3
> Mar  2 12:35:37 noaaportIngester[524] ERROR: Address already in use
> Mar  2 12:35:37 noaaportIngester[524] ERROR: [multicastReader.c:97] Couldn't 
> bind to port 1205
> Mar  2 12:35:37 noaaportIngester[524] ERROR: [noaaportIngester.c:340] 
> Couldn't create multicast-reader
> Mar  2 12:35:37 ldmd[516] NOTE: child 521 exited with status 1: 
> noaaportIngester -m 224.0.1.2 -I 10.0.0.3
> Mar  2 12:35:37 ldmd[516] NOTE: child 522 exited with status 1: 
> noaaportIngester -m 224.0.1.3 -I 10.0.0.3
> Mar  2 12:35:37 noaaportIngester[525] ERROR: Address already in use
> Mar  2 12:35:37 noaaportIngester[525] ERROR: [multicastReader.c:97] Couldn't 
> bind to port 1206
> Mar  2 12:35:37 ldmd[516] NOTE: child 523 exited with status 1: 
> noaaportIngester -m 224.0.1.4 -I 10.0.0.3
> Mar  2 12:35:37 noaaportIngester[525] ERROR: [noaaportIngester.c:340] 
> Couldn't create multicast-reader
> Mar  2 12:35:37 noaaportIngester[526] ERROR: Address already in use
> Mar  2 12:35:37 noaaportIngester[526] ERROR: [multicastReader.c:97] Couldn't 
> bind to port 1207
> Mar  2 12:35:37 noaaportIngester[526] ERROR: [noaaportIngester.c:340] 
> Couldn't create multicast-reader
> Mar  2 12:35:37 pqact[527] NOTE: Starting Up
> Mar  2 12:35:37 ldmd[516] NOTE: child 524 exited with status 1: 
> noaaportIngester -m 224.0.1.5 -I 10.0.0.3
> Mar  2 12:35:37 ldmd[516] NOTE: child 525 exited with status 1: 
> noaaportIngester -m 224.0.1.6 -I 10.0.0.3
> Mar  2 12:35:37 pqact[528] NOTE: Starting Up
> Mar  2 12:35:37 ldmd[516] NOTE: child 526 exited with status 1: 
> noaaportIngester -m 224.0.1.7 -I 10.0.0.3
> Mar  2 12:35:37 pqact[528] NOTE: Starting from insertion-time 2015-03-02 
> 18:01:12.401276 UTC
> Mar  2 12:35:37 pqact[527] NOTE: Starting from insertion-time 2015-03-02 
> 18:01:12.401276 UTC

I suspect that you still have noaaportIngester(1) processes running.

Would it be possible for me to log onto the system in question as the LDM user?

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: SAE-848662
Department: Support LDM
Priority: Normal
Status: Closed