[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #KGU-479417]: More error messages in LDM-6.13.17



Hi Gilbert,

You don't have enough memory. I found this in your system log file:

Dec  1 15:59:00 noaaport3 kernel: [ 1497.803776] Out of memory: Killed
process 3110 (noaaportIngeste) total-vm:9834332kB, anon-rss:20900kB,
file-rss:3427480kB, shmem-rss:0kB, UID:1001 pgtables:6796kB oom_score_adj:0

Because the noaaportIngester(1) processes are run by the the keepRunning(1)
script, they're immediately restarted -- and then killed. Rinse and repeat.

--Steve


address@hidden> wrote:

> New Client Reply: More error messages in LDM-6.13.17
>
> Hi Steve,
>
> I have had this running for about 2 hours now on the NOAAport ingest
> servers, and our main relay server. A couple of thoughts:
>
> 1. This version of the LDM compiled without error on the NOAAport servers
> with the --with-noaaport flag, and on our 2 main relay servers with just a
> normal configure command.
> 2. NOAAport compiles on noaaport3.cod.edu without errors with the
> --with-noaaport flag. But after I increased the queue from 1GB to 10 GB
> before I started the LDM, we had serious issues after the LDM started. The
> entire noaaport3 server in a matter of seconds gets a load average of 20+,
> and no data comes across. It also becomes painfully slow, and it takes
> minutes to execute a simple command, such as "w" or "whoami".
>
> To remedy this problem, I:
>
> A) Made sure that the registry.xml entries were all correct
> B) The 10 GB of queue size was fine (the partition it is on is only 13%
> full with the queue made). Again, I had increased it from 1GB to 10 GB
> before I made the queue.
> C) Looked in the LDM log file. I saw these errors:
>
> ldm@noaaport3:~/var/logs$ more ldmd.log
> 20221201T172938.174332Z ldmd[31809]                 ldmd.c:main:1097
>               NOTE  Starting Up (version: 6.14.1.4; built: Dec  1 2022
> 17:26:20)
> 20221201T172938.174435Z ldmd[31809]
>  ldmd.c:create_ldm_tcp_svc:603       NOTE  Using local address 0.0.0.0:388
> 20221201T172938.185091Z rtstats[31810]              rtstats.c:main:352
>               NOTE  Starting Up (31809)
> 20221201T173023.468685Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31821 terminated by signal 9: noaaportIngester -m
> 224.1.1
> .1 -l /home/ldm/var/logs/nwws.log
> 20221201T173055.442988Z ldmd[31809]                 ldmd.c:cleanup:296
>               NOTE  Exiting
> 20221201T173055.777946Z ldmd[31809]                 ldmd.c:cleanup:356
>               NOTE  Terminating process group
> 20221201T173055.887074Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31815 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.5 -l /home/ldm/var/logs/polar-orbiter.log
> 20221201T173055.887108Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31811 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.1 -l /home/ldm/var/logs/nwstg.log
> 20221201T173055.887130Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31812 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.2 -l /home/ldm/var/logs/goes.log
> 20221201T173055.887151Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31813 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.3 -l /home/ldm/var/logs/nwstg2.log
> 20221201T173055.887171Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31814 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.4 -l /home/ldm/var/logs/oconus.log
> 20221201T173055.887192Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31816 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.6 -l /home/ldm/var/logs/nbm.log
> 20221201T173055.887212Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31817 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.7 -l /home/ldm/var/logs/port1207.log
> 20221201T173055.887232Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31818 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.8 -l /home/ldm/var/logs/experimental.log
> 20221201T173055.887252Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31819 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.9 -l /home/ldm/var/logs/goes-west.log
> 20221201T173055.887273Z ldmd[31809]                 ldmd.c:reap:224
>              NOTE  child 31820 terminated by signal 15: keep_running
> noaaportInges
> ter -n -m 224.0.1.10 -l /home/ldm/var/logs/goes-east.log
> 20221201T173056.057573Z rtstats[31810]              rtstats.c:cleanup:131
>              NOTE  Exiting
> 20221201T173058.563378Z uldbutil[32083]
>  uldb.c:sm_setShmId:1069             NOTE  No such file or directory
> 20221201T173058.563424Z uldbutil[32083]             uldbutil.c:main:98
>               NOTE  The upstream LDM database doesn't exist. Is the LDM
> running?
>
> I then switched back to making the queue 1 GB on noaaport3, which is what
> I started with, instead of 10 GB. Then, everything ran fine. I then tried a
> 30 GB queue, and the same above errors
> happened.
>
> I then went into /root/syslog and found errors. I have attached the syslog
> file so that you can see them. Start with December 1.
>
> Weather3 has different hardware then weather1.cod.edu or weather2.cod.edu,
> or idd.cod.edu. I am not seeing this issue on those machine, and they
> Have BIG queues (22 GB on noaaport1/2, and 75GB on idd.cod.edu, which is
> not a direct NOAAport ingester.)
>
> Here is a sample of "top" on weather3 with the 1GB queue:
>
> top - 17:53:48 up  1:47,  2 users,  load average: 0.29, 0.35, 1.77
> Tasks: 162 total,   1 running, 161 sleeping,   0 stopped,   0 zombie
> %Cpu(s):   1.0/0.8     2[||
>                                                   ]
> MiB Mem :   3829.9 total,   1741.9 free,    817.7 used,   1270.2 buff/cache
> MiB Swap:   3851.0 total,   3782.9 free,     68.1 used.   1739.3 avail Mem
>
> Gilbert Sebenste
> Meteorology Support Analyst
> College of DuPage
>
> -----Original Message-----
> From: Unidata LDM Support <address@hidden>
> Sent: Wednesday, November 30, 2022 4:02 PM
> To: Sebenste, Gilbert <address@hidden>
> Cc: address@hidden
> Subject: [External] [LDM #KGU-479417]: More error messages in LDM-6.13.17
>
> CAUTION: This email originated from outside of COD’s system. Do not click
> links, open attachments, or respond with sensitive information unless you
> recognize the sender and know the content is safe.
>
>
> Hi Gilbert,
>
> Try this LDM. We're testing it here and so far, so good. Let me know how
> it goes.
>
> Better read the CHANGE_LOG file.
>
> --Steve
> Regards,
> Steve Emmerson
>
> Ticket Details
> ===================
> Ticket ID: KGU-479417
> Department: Support LDM
> Priority: Emergency
> Status: Closed
> ===================
> NOTE: All email exchanges with Unidata User Support are recorded in the
> Unidata inquiry tracking system and then made publicly available through
> the web.  If you do not want to have your interactions made available in
> this way, you must let us know in each email you send to us.
>
>
> Ticket Details
> ===================
> Ticket ID: KGU-479417
> Department: Support LDM
> Priority: Emergency
> Status: Open
> Link:
> https://andy.unidata.ucar.edu/staff/index.php?_m=tickets&_a=viewticket&ticketid=33738
>
>



Ticket Details
===================
Ticket ID: KGU-479417
Department: Support LDM
Priority: Emergency
Status: Open
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.