[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #WSJ-190258]: queue size question



John,

> Some portion of the LDM radar and satellite image creation failed again 
> overnight and I'm not sure what the issue is.   The "pqmon" shows the max age 
> at 11004 so that looks better now.
> 
> 20190325T150821.335052Z pqmon[13846] NOTE pqmon.c:358:main() nprods nfree  
> nempty      nbytes  maxprods  maxfree  minempty    maxext  age
> 20190325T150821.335079Z pqmon[13846] NOTE pqmon.c:466:main() 1351711     3    
>    0 123399368448   1351713        6         0 22486561072 11004

Wow! 123 gigabytes! You took what I said and ran with it! :-)

You should be OK -- although with only 24 GB of memory your system will be 
swapping portions of the product-queue in and out continuously. I recommend 
monitoring the LDM system via the "ldmadmin addmetrics" and "ldmadmin 
plotmetrics" facilities. See the documentation for details.

If you can increase the amount of physical memory to be, say, 120% of the 
product-queue, that would make the system more efficient. For your situation, 
you would need approximately 44 GB of physical memory in order to save the last 
hours worth of data.

> I see a lot of these in the logs, but nothing else that stands out to me.
> 20190324T000001.404224Z pqact[22051] WARN filel.c:3016:reap() Child 10270 
> terminated by signal 10

The above means that the child process that was started by a pqact(1) EXEC 
entry and whose process ID was 10270 received a SIGUSR1 signal and, 
consequently, terminated. This signal is used by the LDM system to cause the 
various processes to close and re-open their log files, which is necessary in 
order to change to a new log file. Unfortunately, this particular child-process 
handled the SIGUSR1 in the default manner: by terminating abnormally.

Can you determine what program corresponded to PID 10270?

I consider this a bug in the LDM system and will work on a fix for the next 
release. Thanks for reporting it.

> I do see these as well, but I’m not sure this is tied to the issue:
> *   20190325T022401.340805Z XXX.XXX.XXX.XXX(feed)[8890] NOTE 
> error.c:236:err_log() Couldn't flush connection; flushConnection() failure to 
> 199.133.78.4: RPC: Unable to receive; errno = Connection reset by peer
> *   20190325T022603.051454Z XXX.XXX.XXX.XXX(feed)[2789] NOTE 
> uldb.c:1535:sm_vetUpstreamLdm() Terminated redundant upstream LDM 
> (addr=199.133.78.4, pid=21698, vers=6, type=feeder, mode=alternate, 
> sub=(20190325012401.287302 TS_ENDT {{EXP, ".*"}}))
> *   20190325T022603.051555Z XXX.XXX.XXX.XXX(feed)[21698] NOTE 
> ldmd.c:306:signal_handler() SIGTERM received
> *   20190325T022603.051605Z XXX.XXX.XXX.XXX(feed)[21698] NOTE 
> ldmd.c:187:cleanup() Exiting
> *   20190325T022603.052320Z ldmd[22048] NOTE ldmd.c:170:reap() child 21698 
> exited with status 7

The above means that a receiving LDM process on host XXX.XXX.XXX.XXX subscribed 
to the same feed as a previous receiving LDM process on the same host. The new 
sending LDM process, consequently, terminated the sending LDM process that was 
started by the previous receiving LDM process because: 1) there's no sense in 
duplicating work; and 2) this is a classic denial-of-service vector.

This can be safely ignored unless the two receiving LDM processes are behind a 
NAT, in which case they'll have the same IP address. In this case, the registry 
parameter "/server/enable-anti-DOS" at the sending site should be "false".

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: WSJ-190258
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.