[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[IDD #GEH-202313]: FNEXRAD Composites



Hi Jim,

re:
> The GINI data started coming in again steadily at 1900Z. 

OK, that's good to know.

re:
> I didn't notice any problem with other data, except after
> checking our GOES data, we also had some fallout between
> 1230Z and 1900Z. 

That's also good to know.

re:
> I don't handle the LDM stuff anymore, but from what I see in
> the ldmd.conf  file, the FNEXRAD and UNIWISC feeds are both
> set to come from idd.unidata.ucar.edu.

Assuming that the Plymouth machine is:

srvlp-dpt-met02.plymouth.edu

the cause of the outage can be seen in the latency plot for the
FNEXRAD feed:

https://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?FNEXRAD+srvlp-dpt-met02.plymouth.edu

As you can see from the plot, data was flowing nicely right up until
sometime after 12, and then there were no reports until the feed
came back sometime in the 19Z hour.

The same pattern can be seen in latency plots for other feeds, but not
for all, which is strange.  I checked the real-server backends of both
the idd.unidata.ucar.edu and iddb.unidata.ucar.edu clusters, and I found
the machine that srvlp-dpt-met02.plymouth.edu (at least now) is feeding
from.  The feed REQUESTs that are in place at the moment are as follows:

761497 6 feeder srvlp-dpt-met02 20210509115018.539791 TS_ENDT {{CONDUIT, ".*"}} 
primary
813169 6 feeder srvlp-dpt-met02 20210509140336.665016 TS_ENDT {{NOTHER, ".*"}} 
primary
824831 6 feeder srvlp-dpt-met02 20210509143429.688865 TS_ENDT {{NEXRAD3, ".*"}} 
alternate
874234 6 feeder srvlp-dpt-met02 20210509164240.704017 TS_ENDT {{NIMAGE, ".*"}} 
primary
878009 6 feeder srvlp-dpt-met02 20210509165202.188573 TS_ENDT {{NGRID, ".*"}} 
primary
923632 6 feeder srvlp-dpt-met02 20210509185207.200285 TS_ENDT {{UNIWISC, ".*"}} 
primary
925843 6 feeder srvlp-dpt-met02 20210509185953.844410 TS_ENDT {{FNEXRAD, 
"^(rad/|pnga2area)"}} primary

This shows that the REQUESTs are nicely split so that no one
connection (with the possible exception of CONDUIT) is likely to
get overloaded.  This is very good.

I then search through the LDM log file on the node feeding srvlp-dpt-met02
and I saw the following around 12:35Z:

20210509T123557.202511Z srvlp-dpt-met02.plymouth.edu(feed)[721556] 
ldmd.c:cleanup:197   NOTE  Exiting
 ...
20210509T123558.529301Z srvlp-dpt-met02.plymouth.edu(feed)[4026311] 
ldmd.c:cleanup:197  NOTE  Exiting
 ...
20210509T123559.345515Z srvlp-dpt-met02.plymouth.edu(feed)[4031893] 
ldmd.c:cleanup:197  NOTE  Exiting
 ...
20210509T123559.818295Z srvlp-dpt-met02.plymouth.edu(feed)[4027608] 
ldmd.c:cleanup:197  NOTE  Exiting
 ...
20210509T123601.831048Z srvlp-dpt-met02.plymouth.edu(feed)[4027609] 
ldmd.c:cleanup:197  NOTE  Exiting
 ...
20210509T123606.270431Z srvlp-dpt-met02.plymouth.edu(feed)[4028063] 
ldmd.c:cleanup:197  NOTE  Exiting
 ...
20210509T123610.500275Z srvlp-dpt-met02.plymouth.edu(feed)[4021992] 
ldmd.c:cleanup:197  NOTE  Exiting
 ...
20210509T123612.213791Z srvlp-dpt-met02.plymouth.edu[757006] ldmd.c:cleanup:197 
        NOTE  Exiting
 ...
20210509T123617.076580Z srvlp-dpt-met02.plymouth.edu(feed)[4037579] 
ldmd.c:cleanup:197  NOTE  Exiting

These show that the exiting of 'ldmd' processes feeding 
srvlp-dpt-met02.plymouth.edu.

Soon thereafter, there were a number of Denying connection messages:

20210509T123610.677257Z ldmd[757004]                ldmd.c:runChildLdm:615      
        ERROR Denying connection from "srvlp-dpt-met02" because not allowed
20210509T123610.677289Z ldmd[757004]                ldmd.c:cleanup:197          
        NOTE  Exiting
20210509T123610.677667Z ldmd[7870]                  ldmd.c:reap:176             
        NOTE  Child 757004 exited with status 3
20210509T123611.189453Z ldmd[757005]                ldmd.c:runChildLdm:615      
        ERROR Denying connection from "srvlp-dpt-met02" because not allowed
20210509T123611.189493Z ldmd[757005]                ldmd.c:cleanup:197          
        NOTE  Exiting
20210509T123611.189887Z ldmd[7870]                  ldmd.c:reap:176             
        NOTE  Child 757005 exited with status 3
20210509T123612.213753Z ldmd[757007]                ldmd.c:runChildLdm:615      
        ERROR Denying connection from "srvlp-dpt-met02" because not allowed
20210509T123612.213791Z srvlp-dpt-met02.plymouth.edu[757006] ldmd.c:cleanup:197 
        NOTE  Exiting
20210509T123612.213806Z ldmd[757007]                ldmd.c:cleanup:197          
        NOTE  Exiting
20210509T123612.214446Z ldmd[7870]                  ldmd.c:reap:176             
        NOTE  Child 757007 exited with status 3
20210509T123612.367762Z ldmd[757008]                ldmd.c:runChildLdm:615      
        ERROR Denying connection from "srvlp-dpt-met02" because not allowed
20210509T123612.367793Z ldmd[757008]                ldmd.c:cleanup:197          
        NOTE  Exiting
20210509T123612.368150Z ldmd[7870]                  ldmd.c:reap:176             
        NOTE  Child 757008 exited with status 3
20210509T123616.824282Z ldmd[757011]                ldmd.c:runChildLdm:615      
        ERROR Denying connection from "srvlp-dpt-met02" because not allowed
20210509T123616.824362Z ldmd[757011]                ldmd.c:cleanup:197          
        NOTE  Exiting
20210509T123616.824806Z ldmd[7870]                  ldmd.c:reap:176             
        NOTE  Child 757011 exited with status 3
20210509T123617.076580Z srvlp-dpt-met02.plymouth.edu(feed)[4037579] 
ldmd.c:cleanup:197  NOTE  Exiting
20210509T123621.944261Z ldmd[757012]                ldmd.c:runChildLdm:615      
        ERROR Denying connection from "srvlp-dpt-met02" because not allowed
20210509T123621.944343Z ldmd[757012]                ldmd.c:cleanup:197          
        NOTE  Exiting
20210509T123621.944851Z ldmd[7870]                  ldmd.c:reap:176             
        NOTE  Child 757012 exited with status 3
20210509T123632.080600Z ldmd[757014]                ldmd.c:runChildLdm:615      
        ERROR Denying connection from "srvlp-dpt-met02" because not allowed

So, for some reason that I do not understand (yet), feeds to srvlp-dpt-met02
a) exited, b) attempted to get re-established, but c) were denied.  It is
almost like DNS (reverse DNS in this case) started failing sometime around
12:35Z and then did not get re-established until sometime after 19Z.

I need to get Steve and Mustapha (our new LDM/IDD hire) involved in 
troubleshooting this so that we actually know what happened and, hopefully,
why.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: GEH-202313
Department: Support IDD
Priority: Normal
Status: Open
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.