[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050214: LDM product queue corruption



Gabriel,

I'm investigating your problem and need your help.

The logfile that you attached contained a lot of distracting entries, so
I applied the following filter to it:

    egrep -v 'to LDM 6 on striker|striker\[8872\]: Desired|eldm\[8873\]: 
Desired|pnga2area|nullproc_6 failure to eldm' log.bak

because

    1.  Striker isn't allowing your LDM to connect.

    2.  The LDM on host Eldm has problems.

    3.  The log message from pnga2area(1) are irrelevant.

This resulted in the following:

    Jan 31 14:43:17 twister flood[8875]: Starting Up(6.1.0): 
flood.atmos.uiuc.edu: TS_ZERO TS_ENDT {{FNEXRAD|FSL2|UNIWISC,  ".*"}} 
    Jan 31 14:43:17 twister flood[8876]: Starting Up(6.1.0): 
flood.atmos.uiuc.edu: TS_ZERO TS_ENDT {{IDS|DDPLUS,  ".*"}} 
    Jan 31 14:43:17 twister flood[8877]: Starting Up(6.1.0): 
flood.atmos.uiuc.edu: TS_ZERO TS_ENDT {{NNEXRAD,  "^SDUS51 .* /p(NOR|NOV)"}
    } 
    Jan 31 14:43:17 twister papagayo[8878]: Starting Up(6.1.0): 
papagayo.unl.edu: TS_ZERO TS_ENDT {{HDS,  ".*"}} 
    Jan 31 14:43:17 twister papagayo[8879]: Starting Up(6.1.0): 
papagayo.unl.edu: TS_ZERO TS_ENDT {{FNEXRAD|FSL2|UNIWISC,  ".*"}} 
    Jan 31 14:43:17 twister papagayo[8880]: Starting Up(6.1.0): 
papagayo.unl.edu: TS_ZERO TS_ENDT {{IDS|DDPLUS,  ".*"}} 
    Jan 31 14:43:17 twister papagayo[8881]: Starting Up(6.1.0): 
papagayo.unl.edu: TS_ZERO TS_ENDT {{NNEXRAD,  "^SDUS51 .* /p(NOR|NOV)"}}
     
    Jan 31 14:43:17 twister pluto[8883]: Starting Up(6.1.0): pluto.met.fsu.edu: 
TS_ZERO TS_ENDT {{UNIWISC,  "Q[01] (C[^GH]|U[^78DE])"}} 
    Jan 31 14:43:17 twister flood[8884]: Starting Up(6.1.0): 
flood.atmos.uiuc.edu: TS_ZERO TS_ENDT {{UNIWISC,  "Q[01] (C[^GH]|U[^78DE])"
    }} 
    Jan 31 14:43:17 twister flood[8885]: Starting Up(6.1.0): 
flood.atmos.uiuc.edu: TS_ZERO TS_ENDT {{NNEXRAD,  "/p... (ILN|CLE|IND)"}} 
    Jan 31 14:43:17 twister rtstats[8870]: Starting Up (8867) 
    Jan 31 14:43:17 twister papagayo[8880]: Desired product class: 
20050131134317.817 TS_ENDT {{IDS|DDPLUS,  ".*"}} 
    Jan 31 14:43:17 twister flood[8885]: Desired product class: 
20050131134317.817 TS_ENDT {{NNEXRAD,  "/p... (ILN|CLE|IND)"}} 
    Jan 31 14:43:17 twister pluto[8883]: Desired product class: 
20050131134317.817 TS_ENDT {{UNIWISC,  "Q[01] (C[^GH]|U[^78DE])"}} 
    Jan 31 14:43:17 twister papagayo[8881]: Desired product class: 
20050131134317.817 TS_ENDT {{NNEXRAD,  "^SDUS51 .* /p(NOR|NOV)"}} 
    Jan 31 14:43:17 twister flood[8884]: Desired product class: 
20050131134317.817 TS_ENDT {{UNIWISC,  "Q[01] (C[^GH]|U[^78DE])"}} 
    Jan 31 14:43:17 twister flood[8875]: Desired product class: 
20050131134317.817 TS_ENDT {{FNEXRAD|FSL2|UNIWISC,  ".*"}} 
    Jan 31 14:43:17 twister flood[8876]: Desired product class: 
20050131134317.817 TS_ENDT {{IDS|DDPLUS,  ".*"}} 
    Jan 31 14:43:17 twister flood[8877]: Desired product class: 
20050131134317.817 TS_ENDT {{NNEXRAD,  "^SDUS51 .* /p(NOR|NOV)"}} 
    Jan 31 14:43:17 twister papagayo[8878]: Desired product class: 
20050131134317.817 TS_ENDT {{HDS,  ".*"}} 
    Jan 31 14:43:17 twister papagayo[8879]: Desired product class: 
20050131134317.817 TS_ENDT {{FNEXRAD|FSL2|UNIWISC,  ".*"}} 
    Jan 31 14:43:17 twister stokes[8874]: Desired product class: 
20050131134317.817 TS_ENDT {{HDS,  ".*"}} 
    Jan 31 14:43:17 twister papagayo[8880]: Connected to upstream LDM-6 
    Jan 31 14:43:17 twister flood[8885]: Connected to upstream LDM-6 
    Jan 31 14:43:17 twister papagayo[8881]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister flood[8875]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister flood[8877]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister flood[8876]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister papagayo[8878]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister papagayo[8879]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister stokes[8874]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister papagayo[8880]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister flood[8884]: Connected to upstream LDM-6 
    Jan 31 14:43:18 twister papagayo[8881]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister papagayo[8878]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister papagayo[8879]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister flood[8885]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister flood[8884]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister flood[8876]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister flood[8877]: Upstream LDM is willing to feed 
    Jan 31 14:43:18 twister flood[8875]: Product reclassification by upstream 
LDM: 20050131134317.817 TS_ENDT {{FNEXRAD|FSL2|UNIWISC,  "
    .*"}} -> 20050131134317.817 TS_ENDT {{FNEXRAD|UNIWISC,  ".*"}} 
    Jan 31 14:43:18 twister stokes[8874]: Upstream LDM is willing to feed 
    Jan 31 14:43:19 twister flood[8875]: Upstream LDM is willing to feed 
    Jan 31 14:43:22 twister pluto[8883]: Connected to upstream LDM-6 
    Jan 31 14:43:22 twister pluto[8883]: Upstream LDM is willing to feed 
    Jan 31 20:45:34 twister ldmping[8756]: SVC_UNAVAIL   0.000639    0   
localhost    RPC: Program not registered 
    Jan 31 20:45:34 twister ldmping[8836]: SVC_UNAVAIL   0.000626    0   
localhost    RPC: Program not registered 
    Jan 31 20:45:34 twister pqcheck[8840]: Starting Up (6554) 
    Jan 31 20:45:34 twister pqcheck[8840]: The writer-counter of the 
product-queue is 10 
    Jan 31 20:45:34 twister pqcheck[8840]: Exiting 
    Feb 01 01:59:50 twister ldmping[8761]: SVC_UNAVAIL   0.000669    0   
localhost    RPC: Program not registered 
    Feb 01 01:59:50 twister ldmping[8841]: SVC_UNAVAIL   0.000624    0   
localhost    RPC: Program not registered 
    Feb 01 01:59:50 twister pqcheck[8845]: Starting Up (6835) 
    Feb 01 01:59:50 twister pqcheck[8845]: The writer-counter of the 
product-queue is 10 
    Feb 01 01:59:50 twister pqcheck[8845]: Exiting 
    Feb 01 02:28:01 twister ldmping[18154]: SVC_UNAVAIL   0.000665    0   
localhost    RPC: Program not registered 
    Feb 01 02:28:06 twister ldmping[18160]: SVC_UNAVAIL   0.000718    0   
localhost    RPC: Program not registered 
    Feb 01 02:28:06 twister pqcheck[18164]: Starting Up (18155) 
    Feb 01 02:28:06 twister pqcheck[18164]: The writer-counter of the 
product-queue is 10 
    Feb 01 02:28:06 twister pqcheck[18164]: Exiting 
    Feb 01 02:28:15 twister ldmping[18175]: SVC_UNAVAIL   0.000690    0   
localhost    RPC: Program not registered 
    Feb 01 02:28:15 twister pqcheck[18179]: Starting Up (18170) 
    Feb 01 02:28:15 twister pqcheck[18179]: The writer-counter of the 
product-queue is 10 
    Feb 01 02:28:15 twister pqcheck[18179]: Exiting 
    Feb 01 02:28:28 twister ldmping[18185]: SVC_UNAVAIL   0.000723    0   
localhost    RPC: Program not registered 
    Feb 01 02:28:30 twister ldmping[18193]: SVC_UNAVAIL   0.000696    0   
localhost    RPC: Program not registered 
    Feb 01 02:28:54 twister ldmping[18200]: SVC_UNAVAIL   0.000656    0   
localhost    RPC: Program not registered 
    Feb 01 02:28:54 twister pqcheck[18204]: Starting Up (18195) 
    Feb 01 02:28:54 twister pqcheck[18204]: The writer-counter of the 
product-queue is 0 
    Feb 01 02:28:54 twister pqcheck[18204]: Exiting 

The interesting thing here is that from 2005-01-31T14:43:22 to
2005-01-31T20:45:34 UTC the LDM was, apparently, working OK.  At that
later time, however, an ldmping(1) process was started and determined
that the LDM wasn't available.  A pqcheck(1) process that was started at
the same time, however, indicated that the LDM product-queue had ten 
writers into it.

Who started that ldmping(1) process and why?  Was it a part of a larger
script (started by crontab(1) perhaps)?  Could that script also do bad
things to the product-queue?

The logfile entry just prior to that ldmping(1) entry is

    Jan 31 20:43:28 twister eldm[8873]: ERROR: requester6.c:455; 
ldm_clnt.c:286: nullproc_6 failure to eldm.fsl.noaa.gov; ldm_clnt.c:142: RPC: 
Unable to receive; errno = Connection reset by peer 

which is 2 minutes 4 seconds earlier than the ldmping(1) entry.  So the
LDM system appears to have been working up to that time -- with no
indication of a problem other than those unreachable hosts.

When the LDM system is running, can you ldmping(1) it?

Regards,
Steve Emmerson