[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #WSJ-190258]: queue size question



John,

> We keep having the gempak images (radar/satellite) just stop after LDM runs 
> for approximately 2.5 days.  Here's the only errors I see during that period. 
>  Any ideas you may have are greatly appreciated.  Thanks,
> 
> LDM log from that time:
> 
> 20190403T023521.100032Z idd.unidata.ucar.edu[7193] WARN error.c:236:err_log() 
> Couldn't connect to LDM on idd.unidata.ucar.edu using either port 388 or 
> portmapper; : RPC: Remote system error - Connection timed out

The above message means that receiving LDM process 7193 couldn't connect to 
sending LDM idd.unidata.ucar.edu because that system was offline. UCAR had 
networking problems yesterday.

> 20190403T102901.679888Z 199.133.78.4(feed)[11548] ERROR pq.c:3377:fd_lock() 
> Interrupted system call
> 20190403T102901.679901Z 199.133.78.4(feed)[11548] ERROR pq.c:3377:fd_lock() 
> fcntl F_RDLCK failed for rgn (0 SEEK_SET, 4096) 4
> 20190403T102901.679910Z 199.133.78.4(feed)[11548] ERROR up6.c:532:up6_run() 
> Product send failure: Interrupted system call

The above messages mean that sending LDM process 11548 couldn't send a 
data-product to its receiving LDM because it was interrupted by a signal -- 
most likely due to an "ldmadmin stop".

> 20190403T140901.828355Z 199.133.78.4(feed)[21358] ERROR pq.c:3377:fd_lock() 
> Interrupted system call
> 20190403T140901.828370Z 199.133.78.4(feed)[21358] ERROR pq.c:3377:fd_lock() 
> fcntl F_RDLCK failed for rgn (0 SEEK_SET, 4096) 4
> 20190403T140901.828382Z 199.133.78.4(feed)[21358] ERROR up6.c:532:up6_run() 
> Product send failure: Interrupted system call

Ibid.

> 20190403T194347.889234Z 199.133.78.4[16693] WARN forn.c:41:logIfReduced() 
> Subscription reduced by one or more ALLOW entries: 20190403194347.731079 
> TS_ENDT {{ANY, ".*"}} -> 20190403194347.731079 TS_ENDT {{EXP, ".*"}}

The above message means that the subscription request by receiving LDM 16693 
for ANY data-products was reduced by its sending LDM to EXP due to the ALLOW 
entries in the sending LDM's configuration-file.

> 20190403T194601.846940Z 199.133.78.4(noti)[16693] ERROR 
> forn5_svc.c:273:noti5_sqf() /home/ldm/cellphon/CELLPHON.DAT: RPC: Unable to 
> receive
> 20190403T194601.846964Z 199.133.78.4(noti)[16693] ERROR 
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno = 
> 5)

The above messages mean that notifying LDM process 16693 was disconnected from 
its receiving LDM. This was likely because the receiving LDM terminated.

> 20190403T194744.878579Z 199.133.78.4[30952] WARN forn.c:41:logIfReduced() 
> Subscription reduced by one or more ALLOW entries: 20190403194744.721951 
> TS_ENDT {{ANY, ".*"}} -> 20190403194744.721951 TS_ENDT {{EXP, ".*"}}
> 20190403T195401.932271Z 199.133.78.4(noti)[30952] ERROR 
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso031945.mtr: RPC: Unable to receive
> 20190403T195401.932304Z 199.133.78.4(noti)[30952] ERROR 
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno = 
> 5)
> 20190403T195547.587961Z 199.133.78.4[27688] WARN forn.c:41:logIfReduced() 
> Subscription reduced by one or more ALLOW entries: 20190403195547.431097 
> TS_ENDT {{ANY, ".*"}} -> 20190403195547.431097 TS_ENDT {{EXP, ".*"}}
> 20190403T195653.093268Z 199.133.78.4[31798] WARN forn.c:41:logIfReduced() 
> Subscription reduced by one or more ALLOW entries: 20190403195652.935582 
> TS_ENDT {{ANY, ".*"}} -> 20190403195652.935582 TS_ENDT {{EXP, ".*"}}
> 20190403T195901.611237Z 199.133.78.4(noti)[27688] ERROR 
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso031950.mtr: RPC: Unable to receive
> 20190403T195901.611269Z 199.133.78.4(noti)[27688] ERROR 
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno = 
> 5)
> 20190403T200401.265716Z 199.133.78.4(noti)[31798] ERROR 
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso031955.mtr: RPC: Unable to receive
> 20190403T200401.265740Z 199.133.78.4(noti)[31798] ERROR 
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno = 
> 5)
> 20190403T200439.508639Z 199.133.78.4[27895] WARN forn.c:41:logIfReduced() 
> Subscription reduced by one or more ALLOW entries: 20190403200439.340047 
> TS_ENDT {{ANY, ".*"}} -> 20190403200439.340047 TS_ENDT {{EXP, ".*"}}
> 20190403T201601.871599Z 199.133.78.4(noti)[27895] ERROR 
> forn5_svc.c:273:noti5_sqf() /home/ldm/cellphon/CELLPHON.DAT: RPC: Unable to 
> receive
> 20190403T201601.871624Z 199.133.78.4(noti)[27895] ERROR 
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno = 
> 5)
> 20190403T204634.154906Z 199.133.78.4[26788] WARN forn.c:41:logIfReduced() 
> Subscription reduced by one or more ALLOW entries: 20190403204633.989049 
> TS_ENDT {{ANY, ".*"}} -> 20190403204633.989049 TS_ENDT {{EXP, ".*"}}
> 20190403T210901.479227Z 199.133.78.4(noti)[26788] ERROR 
> forn5_svc.c:542:forn_5_svc() nullproc5(199.133.78.4): RPC: Unable to receive
> 20190403T210926.592774Z 199.133.78.4[17846] WARN forn.c:41:logIfReduced() 
> Subscription reduced by one or more ALLOW entries: 20190403210401.246756 
> TS_ENDT {{ANY, ".*"}} -> 20190403210401.246756 TS_ENDT {{EXP, ".*"}}
> 20190403T211901.233617Z 199.133.78.4(noti)[17846] ERROR 
> forn5_svc.c:273:noti5_sqf() TX.TTUmeso032110.mtr: RPC: Unable to receive
> 20190403T211901.233639Z 199.133.78.4(noti)[17846] ERROR 
> forn5_svc.c:554:forn_5_svc() pq_sequence failed: Input/output error (errno = 
> 5)

Ibid.

> 20190404T000001.730946Z pqact[7179] WARN filel.c:3016:reap() Child 16321 
> terminated by signal 10
> 20190404T000001.736596Z pqact[7179] WARN filel.c:3016:reap() Child 29866 
> terminated by signal 10
> 20190404T000001.736636Z pqact[7179] WARN filel.c:3016:reap() Child 25337 
> terminated by signal 10
> 20190404T000001.736673Z pqact[7179] WARN filel.c:3016:reap() Child 9485 
> terminated by signal 10
> 20190404T000001.736702Z pqact[7179] WARN filel.c:3016:reap() Child 5313 
> terminated by signal 10
> 20190404T000001.736726Z pqact[7179] WARN filel.c:3016:reap() Child 4785 
> terminated by signal 10
> 20190404T000001.736749Z pqact[7179] WARN filel.c:3016:reap() Child 2641 
> terminated by signal 10
> 20190404T000001.736779Z pqact[7179] WARN filel.c:3016:reap() Child 3026 
> terminated by signal 10
> 20190404T000001.736815Z pqact[7179] WARN filel.c:3016:reap() Child 3211 
> terminated by signal 10
> 20190404T000001.736842Z pqact[7179] WARN filel.c:3016:reap() Child 3839 
> terminated by signal 10

The above messages mean that the indicated child processes of pqact(1) process 
7179 terminated due to reception of a USR1 signal. This signal is used by the 
LDM system as part of the process to rotate the LDM log file. Because this 
happened near 0000Z, the cause was 
likely a crontab(1) entry with the command "ldmadmin newlog".

Prior to LDM 6.13.11 (which isn't out yet), programs executed by pqact(1)'s 
EXEC or PIPE actions had to block signals USR1 and USR2 to avoid such 
termination.

Could this be the cause of your problems?

> /var/log/messages:
> 
> Apr  3 21:32:17 mammatus abrt-hook-ccpp: Process 9497 (gif) of user 1009 
> killed by SIGABRT - dumping core
> Apr  3 21:32:18 mammatus abrt-server: Duplicate: core backtrace
> Apr  3 21:32:18 mammatus abrt-server: DUP_OF_DIR: 
> /var/spool/abrt/ccpp-2019-03-25-15:37:40-20396
> Apr  3 21:32:18 mammatus abrt-server: Deleting problem directory 
> ccpp-2019-04-03-21:32:17-9497 (dup of ccpp-2019-03-25-15:37:40-20396)
> Apr  3 21:32:18 mammatus abrt-server: Undefined variable outside of [[ ]] 
> bracket
> 
> Contents of that ABRT trace:
> 
> Reason:
> 
> gif killed by SIGABRT
> 
> Limits:
> 
> Limit                     Soft Limit           Hard Limit           Units
> Max cpu time              unlimited            unlimited            seconds
> Max file size             unlimited            unlimited            bytes
> Max data size             unlimited            unlimited            bytes
> Max stack size            8388608              unlimited            bytes
> Max core file size        0                    unlimited            bytes
> Max resident set          unlimited            unlimited            bytes
> Max processes             65536                95277                processes
> Max open files            1024                 4096                 files
> Max locked memory         65536                65536                bytes
> Max address space         unlimited            unlimited            bytes
> Max file locks            unlimited            unlimited            locks
> Max pending signals       95277                95277                signals
> Max msgqueue size         819200               819200               bytes
> Max nice priority         0                    0
> Max realtime priority     0                    0
> Max realtime timeout      unlimited            unlimited            us

None of the above appears relevant.

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: WSJ-190258
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.