[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #QZZ-219946]: LDM product queue vanishes



Mike,

Does your operating system run the out-of-memory (OOM) utility?

> I have an issue where the LDM product queue will occasionally and
> quite mysteriously up and vanish.  When this happens LDM will continue
> to run, but the flow of data stops for obvious reasons.
> 
> This has happened to us in the past on our NOAAPort ingest servers
> (and not on any others), and we've never found any reason for it
> there.  But as I'm setting up a new server to visualize GOES-16 data,
> it's happening here.  The good news is I've been able to collect a bit
> more evidence, so I wanted to bring it to you.
> 
> I wrote a script that runs every 5 minutes to check the status of LDM
> (ldmadmin isrunning) and the product queue (pqcheck), and if there's a
> problem with either it remakes the queue and restarts LDM.  It also
> does a bit of logging for me, so I know when it saw a problem and what
> the problem was.  I started running this check yesterday, and since
> then has logged the product queue had vanished twice (2017-10-26
> 22:15Z and 2017-10-27 12:05Z).  I peered into journalctl and looked
> for things that happened just prior and found something, log snippets
> below.
> 
> Here's my interpretation:  Both times I logged off the system as user
> ldm, and immediately there-after some action is taken that kills a
> process.  The next time my monitor script runs it finds the product
> queue missing.  So my logging off triggered something?  I understand
> the system essentially goes through a routine upon user log-off, but
> not sure I how to tie that to the queue removal.  I'm also not sure
> if/how this relates to the similar issue on our NOAAPort servers,
> where I doubt ssh user log-off is the trigger.  All I know is because
> of the timing of both of these incidents, I highly doubt it could be a
> coincidence.
> 
> The first time I logged off, I had su'd to user ldm from another user,
> so I simply exited that session.  The second time I had ssh'd from
> home into another server, and from there into this one as user ldm, so
> that's another difference if it matters.  I can't seem to recreate
> this reliably by logging on and off, appears to happen randomly.
> 
> This server runs Ubuntu 16.04.3 LTS, kernel version 4.4.0-97-generic,
> LDM version 6.13.6.  Nothing appears in the ldmd.log file, the product
> queue gets removed very quietly.  UID 1001 from the log below is user
> ldm.
> 
> journalctl log snippet from first incident:
> Oct 26 22:13:21 wxgoes sshd[6309]: pam_unix(sshd:session): session
> closed for user ldm
> Oct 26 22:13:21 wxgoes sshd[30437]: pam_unix(sshd:session): session
> closed for user ldm
> Oct 26 22:13:21 wxgoes systemd-logind[1925]: Removed session 8598.
> Oct 26 22:13:21 wxgoes systemd-logind[1925]: Removed session 8606.
> Oct 26 22:13:21 wxgoes systemd[1]: Stopping User Manager for UID 1001...
> Oct 26 22:13:21 wxgoes systemd[6981]: Reached target Shutdown.
> Oct 26 22:13:21 wxgoes systemd[6981]: Starting Exit the Session...
> Oct 26 22:13:21 wxgoes systemd[6981]: Stopped target Default.
> Oct 26 22:13:21 wxgoes systemd[6981]: Stopped target Basic System.
> Oct 26 22:13:21 wxgoes systemd[6981]: Stopped target Timers.
> Oct 26 22:13:21 wxgoes systemd[6981]: Stopped target Sockets.
> Oct 26 22:13:21 wxgoes systemd[6981]: Stopped target Paths.
> Oct 26 22:13:21 wxgoes systemd[6981]: Received SIGRTMIN+24 from PID
> 16791 (kill).
> Oct 26 22:13:21 wxgoes systemd[6983]: pam_unix(systemd-user:session):
> session closed for user ldm
> Oct 26 22:13:21 wxgoes systemd[1]: Stopped User Manager for UID 1001.
> Oct 26 22:13:21 wxgoes systemd[1]: Removed slice User Slice of ldm.
> 
> And from the second:
> Oct 27 12:03:24 wxgoes sshd[28141]: Received disconnect from
> 10.11.0.64 port 55338:11: disconnected by user
> Oct 27 12:03:24 wxgoes sshd[28141]: Disconnected from 10.11.0.64 port 55338
> Oct 27 12:03:24 wxgoes sshd[27975]: pam_unix(sshd:session): session
> closed for user ldm
> Oct 27 12:03:24 wxgoes systemd-logind[1925]: Removed session 10148.
> Oct 27 12:03:24 wxgoes systemd[1]: Stopping User Manager for UID 1001...
> Oct 27 12:03:24 wxgoes systemd[27986]: Reached target Shutdown.
> Oct 27 12:03:24 wxgoes systemd[27986]: Starting Exit the Session...
> Oct 27 12:03:24 wxgoes systemd[27986]: Stopped target Default.
> Oct 27 12:03:24 wxgoes systemd[27986]: Stopped target Basic System.
> Oct 27 12:03:24 wxgoes systemd[27986]: Stopped target Timers.
> Oct 27 12:03:24 wxgoes systemd[27986]: Stopped target Sockets.
> Oct 27 12:03:24 wxgoes systemd[27986]: Stopped target Paths.
> Oct 27 12:03:24 wxgoes systemd[27986]: Received SIGRTMIN+24 from PID
> 46395 (kill).
> Oct 27 12:03:24 wxgoes systemd[1]: Stopped User Manager for UID 1001.
> Oct 27 12:03:24 wxgoes systemd[1]: Removed slice User Slice of ldm.
> 
> Any thoughts?  Let me know if you need more information.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: QZZ-219946
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.