[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20020419:RPC runaway RECLASS & h_clnt_call



Kevin, 

Nice work isolating those RPC calls!

I have no idea why they do not die...we will investigate.

What is your OS and LDM version?


Have URI or NYCC complained about the latencies in their feeds?

It may be worth contacting them and determining what the status of their
data needs are..With this type of latency it may be best for them to
throttle back the feed and get what they want in a timely manner, or if
they are using the HDS/HRS feed in a non-real time environment, this may
be working for them (but I bet they are dropping products)...they may want
to subset the HDS and MCIDAS feeds, as they are the most voluminous.

Thanks for the follow up,

-Jeff
____________________________                  _____________________
Jeff Weber                                    address@hidden
Unidata Support                               PH:303-497-8676 
NWS-COMET Case Study Library                  FX:303-497-8690
University Corp for Atmospheric Research      3300 Mitchell Ln
http://www.unidata.ucar.edu/staff/jweber      Boulder,Co 80307-3000
________________________________________      ______________________

On Mon, 22 Apr 2002, Kevin R. Tyle wrote:

> Hi Jeff,
> 
> Here's a follow up on last week's email:
> 
> We continue to notice poor data flow to two of our downstream
> sites:  aspre.gso.uri.edu and wr.psi.ncc.edu:  here's a snippet
> from our LDM logs:
> 
> Apr 22 13:54:33 redwood aspre(feed)[8826]: RECLASS: 20020422125337.696
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 13:56:06 redwood wr(feed)[4033]: h_clnt_call: wr.psi.ncc.edu: 1:
> time elapsed  21.121077
> Apr 22 13:56:56 redwood wr(feed)[4033]: h_clnt_call: wr.psi.ncc.edu: 1:
> time elapsed  49.816683
> Apr 22 13:57:38 redwood aspre(feed)[8826]: RECLASS: 20020422125643.044
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:00:31 redwood aspre(feed)[8826]: RECLASS: 20020422125935.977
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:00:57 redwood aspre[9256]: Connection from aspre.gso.uri.edu
> Apr 22 14:00:57 redwood aspre[9256]: Connection reset by peer
> Apr 22 14:00:57 redwood aspre[9256]: Exiting
> Apr 22 14:04:39 redwood aspre(feed)[8826]: RECLASS: 20020422130344.068
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:04:41 redwood wr(feed)[4033]: h_clnt_call: wr.psi.ncc.edu:
> BLKDATA: time elapsed  38.479272
> Apr 22 14:07:37 redwood aspre(feed)[8826]: RECLASS: 20020422130641.992
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:07:51 redwood 129.92.9.62(feed)[21134]: h_clnt_call:
> 129.92.9.62: BLKDATA: time elapsed  21.164570
> Apr 22 14:10:55 redwood wr(feed)[4033]: RECLASS: 20020422131258.799
> TS_ENDT {{MCIDAS|IDS|DDPLUS,  ".*"}}
> Apr 22 14:10:56 redwood aspre[9677]: Connection from aspre.gso.uri.edu
> Apr 22 14:10:57 redwood aspre[9677]: Connection reset by peer
> Apr 22 14:10:57 redwood aspre[9677]: Exiting
> Apr 22 14:12:29 redwood wr(feed)[4033]: RECLASS: 20020422131432.822
> TS_ENDT {{MCIDAS|IDS|DDPLUS,  ".*"}}
> Apr 22 14:13:58 redwood aspre(feed)[8826]: SMBZ04 SBBR 221200 RRB: RPC:
> Timed out (5)
> Apr 22 14:13:58 redwood aspre(feed)[8826]: pq_sequence failed: I/O error
> (errno = 5)
> Apr 22 14:13:58 redwood aspre(feed)[8826]: Exiting
> Apr 22 14:14:04 redwood rpc.ldmd[21105]: child 8826 exited with status 1
> Apr 22 14:14:23 redwood wr(feed)[4033]: RECLASS: 20020422131626.993
> TS_ENDT {{MCIDAS|IDS|DDPLUS,  ".*"}}
> Apr 22 14:14:31 redwood aspre[9737]: Connection from aspre.gso.uri.edu
> Apr 22 14:14:31 redwood aspre(feed)[9737]: Starting Up: 20020422131333.243
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:14:31 redwood aspre(feed)[9737]: topo:  aspre.gso.uri.edu WMO
> Apr 22 14:15:08 redwood wr(feed)[4033]: h_clnt_call: wr.psi.ncc.edu:
> BLKDATA: time elapsed  24.559832
> Apr 22 14:15:30 redwood wr(feed)[4033]: h_clnt_call: wr.psi.ncc.edu:
> BLKDATA: time elapsed  22.025091
> Apr 22 14:17:26 redwood wr(feed)[4033]: RECLASS: 20020422131930.272
> TS_ENDT {{MCIDAS|IDS|DDPLUS,  ".*"}}
> Apr 22 14:17:46 redwood aspre(feed)[9737]: RECLASS: 20020422131650.686
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:20:56 redwood aspre[9844]: Connection from aspre.gso.uri.edu
> Apr 22 14:20:57 redwood aspre[9844]: Connection reset by peer
> Apr 22 14:20:57 redwood aspre[9844]: Exiting
> Apr 22 14:22:34 redwood aspre(feed)[9737]: RECLASS: 20020422132139.452
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:23:30 redwood wr(feed)[4033]: RECLASS: 20020422132534.156
> TS_ENDT {{MCIDAS|IDS|DDPLUS,  ".*"}}
> Apr 22 14:25:03 redwood wr(feed)[4033]: RECLASS: 20020422132706.754
> TS_ENDT {{MCIDAS|IDS|DDPLUS,  ".*"}}
> Apr 22 14:28:09 redwood aspre(feed)[9737]: RECLASS: 20020422132713.527
> TS_ENDT {{WMO,  ".*"}}
> Apr 22 14:28:12 redwood wr(feed)[4033]: h_clnt_call: wr.psi.ncc.edu: 1:
> time elapsed  22.100896
> Apr 22 14:28:12 redwood wr(feed)[4033]: RECLASS: 20020422133016.090
> TS_ENDT {{MCIDAS|IDS|DDPLUS,  ".*"}}
> 
> 
> The feed to aspre seems chronically slow--i.e. always an hour or
> so behind.  I would think the use of the feed at URI is useless
> to them.  The traceroute suggests problems at the abilene-->commercial
> internet junction:
> 
> traceroute to aspre.gso.uri.edu (131.128.102.86), 30 hops max, 40 byte
> packets
>  1  fire43.atmos.albany.edu (169.226.43.200)  0.567 ms  0.262 ms  0.239 ms
>  2  be102 (169.226.4.1)  0.669 ms  0.612 ms  0.622 ms
>  3  169.226.13.1 (169.226.13.1)  2.642 ms  2.406 ms  1.999 ms
>  4  vbns-ualbany.nysernet.net (199.109.4.50)  6.869 ms  6.289 ms  6.039 ms
>  5  jn1-so7-0-0-0.wae.vbns.net (204.147.136.136)  11.640 ms  11.536 ms
> 11.303 ms
>  6  abilene-vbns.abilene.ucaid.edu (198.32.11.9)  12.958 ms  12.954 ms
> 12.718 ms
>  7  nycm-wash.abilene.ucaid.edu (198.32.8.46)  17.393 ms  17.116 ms
> 18.959 ms
>  8  POS30-OC12-GIGAPOPNE-ABILENE.NOX.ORG (192.5.89.101)  22.815 ms  22.067
> ms  22.622 ms
>  9  OSHEAN-GIGAPOPNE.NOX.ORG (192.5.89.78)  24.091 ms  24.482 ms  24.835
> ms
> 10  138.16.2.38 (138.16.2.38)  183.874 ms  178.273 ms  175.446 ms
> 11  131.128.19.70 (131.128.19.70)  194.375 ms  182.115 ms  185.525 ms
> 12  aspre.gso.uri.edu (131.128.102.86)  174.647 ms  183.702 ms *
> 
> Likewise, the feed to Nassau C.C. shows a possible bottleneck
> near the end of the route:
> 
> traceroute to wr.psi.ncc.edu (198.38.12.245), 30 hops max, 40 byte packets
>  1  fire43.atmos.albany.edu (169.226.43.200)  0.528 ms  0.266 ms  0.242 ms
>  2  be102 (169.226.4.1)  7.907 ms  2.365 ms  0.628 ms
>  3  169.226.13.1 (169.226.13.1)  1.883 ms  2.077 ms  1.242 ms
>  4  169.130.254.129 (169.130.254.129)  2.276 ms  2.077 ms  2.018 ms
>  5  at-gsr1-nyc-3-0-OC12.appliedtheory.net (169.130.3.29)  5.183 ms  5.550
> ms  5.359 ms
>  6  12.125.51.209 (12.125.51.209)  6.216 ms  6.756 ms  5.363 ms
>  7  gbr1-p20.n54ny.ip.att.net (12.123.1.130)  5.595 ms  6.054 ms  6.166 ms
>  8  ar14-a300s1.n54ny.ip.att.net (12.123.1.25)  6.748 ms  6.986 ms  6.748
> ms
>  9  12.119.78.126 (12.119.78.126)  93.425 ms 12.119.78.122 (12.119.78.122)
> 100.454 ms 12.119.78.126 (12.119.78.126)  298.319 ms
> 10  cisco0.nocnet.sunynassau.edu (198.38.8.1)  120.220 ms  162.050 ms
> 149.426 ms
> 11  WR.PSI.SUNYNASSAU.EDU (198.38.12.245)  139.107 ms  157.937 ms  131.152
> ms
> 
> I would think that the gridded HDS feed is useless to both these
> downstream sites.  Perhaps both sites would be better off only
> requesting DDPLUS (text) data?
> 
> Secondly, we believe we have found the "culprit" for the excessive
> rpc.ldmd processes that have recently compromised our LDM server:
> we had been requesting and allowing accesses from sysu1.wsicorp.com:
> note the following entries:
> 
> Apr 20 11:11:15 redwood sysu1[27665]: Connection from sysu1.wsicorp.com
> Apr 20 11:47:41 redwood sysu1[27665]: hiya: 20020420104741.770 TS_ENDT
> {{WSI,  ".*"}}
> Apr 20 11:47:42 redwood sysu1[27665]: Connection reset by peer
> Apr 20 11:47:42 redwood sysu1[27665]: Exiting
> 
> 
> There are many, many examples of connections from WSI like this.
> However, in this case, as well as some others, even though the
> log shows the process # 27665 "exiting", it doesn't actually
> terminate, and continues to show up when a "ps" is run.  Eventually
> we end up with a boatload of rpc.ldmd processes that do not get
> killed, even when ldmadmin stop is run.  Only a "kill -9" will
> waste these suckers.  This effectively causes a DoS on our LDM
> server.
> 
> Since we are no longer using the WSI feed, we have commented out
> the WSI access, and things have been okay since then, but why
> don't these processes die off?
> 
> (Please cc David Knight on reply--thanks)
> 
> Thanks,
> 
> Kevin
> 
> ______________________________________________________________________
> Kevin Tyle, Systems Administrator               **********************
> Dept. of Earth & Atmospheric Sciences           address@hidden
> University at Albany, ES-235                    518-442-4571 (voice)
> 1400 Washington Avenue                          518-442-5825 (fax)
> Albany, NY 12222                                **********************
> ______________________________________________________________________
> 
> On Fri, 19 Apr 2002, Jeff Weber wrote:
> 
> > Hi Kevin,
> >
> > The RECLASS messages are showing us that the LDM is skipping a number
> > of products in order to catch up.
> >
> > It opens an exchange between host and client to negotiate requests with
> > what is desired and what is available
> >
> >
> > The 'h_clnt_call:' notices are letting us know that it is taking an
> > exceedingly long time to send pieces of a large product.  It is likely
> > that the product that is being sent is large..
> >
> >
> > Both of these indicate that there may be poor connectivity between the two
> > machines.
> >
> >
> > The "runaway" rpc processes are a bit of a mystery, we have seen this
> > before, but each of the past instances do not seem germaine to your
> > situation.
> >
> > 1) Bad queue..deleting and remaking solved problem
> >
> > 2) Non-comaptible versions of ldm and not recoginizing feedtype
> >
> > 3) syslog and ldmd.log failing
> >
> >
> > comments also in body of text...
> >
> >
> > Any additional info you can provide may lead us to a solution, I look
> > forward to hopefully solving this issue..
> >
> > Thank you,
> >
> >
> > -Jeff
> > ____________________________                  _____________________
> > Jeff Weber                                    address@hidden
> > Unidata Support                               PH:303-497-8676
> > NWS-COMET Case Study Library                  FX:303-497-8690
> > University Corp for Atmospheric Research      3300 Mitchell Ln
> > http://www.unidata.ucar.edu/staff/jweber      Boulder,Co 80307-3000
> > ________________________________________      ______________________
> >
> > On Fri, 19 Apr 2002, Unidata Support wrote:
> >
> > >
> > > ------- Forwarded Message
> > >
> > > >To: <address@hidden>
> > > >cc: David Knight <address@hidden>
> > > >From: "Kevin R. Tyle" <address@hidden>
> > > >Subject: LDM log messages
> > > >Organization: UCAR/Unidata
> > > >Keywords: 200204191536.g3JFafa14575
> > >
> > > Hi,
> > >
> > > Can you explain the meaning of some of the following entries from
> > > our ldm log?
> > >
> > > Particularly the repeated RECLASS messages and the h_clnt_call
> > > messages.
> > >
> > > I am assuming they are due to slowdowns between redwood and
> > > the client LDM's, but am not sure.
> > >
> > > We also run the IP Filter firewall package, and we keep state
> > > on incoming port 388 traffic to redwood.  A look at the state
> > > table right now shows a bunch of active states between us
> > > and wr.psi.ncc.edu, although only one rpc.ldmd is running between
> > > us and them.  Again, I suspect this may be due to some internet
> > > slowdown last night, since the states were all created near the
> > > same time, about 10 hours ago.  This may be outside your realm,
> > > but I am wondering why the state tables still show these connections
> > > as fully active--typically, with a protocol like FTP
> >
> > Are these LDM owned processes?  I am curious because of the protocol, LDM
> > does not use FTP...
> >
> > , once a
> > > connection breaks, the nature of the state changes to one that
> > > expires out of the state table much quicker.
> > >
> > > Although this did not happen last night, we also have recently
> > > had instances of many rpc.ldmd processes being launched--the end
> > > result being a very high load on the server and no data until
> > > the LDM is restarted and all rpc.ldmd processes manually killed.
> > > We have set up monitoring to investigate this in more detail should
> > > it happen again, but it appears to occur when our .stats files
> > > show latencies > 3600 on DDPLUS and HDS.
> > >
> > > Thanks for any info!
> > >
> > > --Kevin
> > >
> > > ______________________________________________________________________
> > > Kevin Tyle, Systems Administrator               **********************
> > > Dept. of Earth & Atmospheric Sciences           address@hidden
> > > University at Albany, ES-235                    518-442-4571 (voice)
> > > 1400 Washington Avenue                          518-442-5825 (fax)
> > > Albany, NY 12222                                **********************
> > > ______________________________________________________________________
> > >
> > > Apr 19 04:24:50 redwood aspre(feed)[5749]: RECLASS:
> > > 20020419032353.835 TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:25:51 redwood aspre(feed)[5749]: RECLASS: 20020419032455.332
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:26:57 redwood aspre(feed)[5749]: RECLASS: 20020419032600.645
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:28:07 redwood aspre(feed)[5749]: RECLASS: 20020419032710.578
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:29:13 redwood aspre(feed)[5749]: RECLASS: 20020419032817.164
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:30:29 redwood aspre(feed)[5749]: RECLASS: 20020419032933.031
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:30:57 redwood aspre[7597]: Connection from aspre.gso.uri.edu
> > > Apr 19 04:30:57 redwood aspre[7597]: Connection reset by peer
> > > Apr 19 04:30:57 redwood aspre[7597]: Exiting
> > > Apr 19 04:31:57 redwood aspre(feed)[5749]: h_clnt_call: aspre.gso.uri.edu:
> > > BLKDATA: time elapsed  22.674846
> > > Apr 19 04:32:07 redwood aspre(feed)[5749]: RECLASS: 20020419033111.112
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:33:14 redwood aspre(feed)[5749]: RECLASS: 20020419033218.555
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:34:28 redwood aspre(feed)[5749]: RECLASS: 20020419033331.960
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:35:15 redwood wr(feed)[26367]: h_clnt_call: wr.psi.ncc.edu:
> > > BLKDATA: time elapsed  21.547277
> > > Apr 19 04:35:40 redwood aspre(feed)[5749]: RECLASS: 20020419033443.938
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:36:52 redwood aspre(feed)[5749]: RECLASS: 20020419033556.206
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:37:54 redwood aspre(feed)[5749]: RECLASS: 20020419033657.879
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:38:57 redwood aspre(feed)[5749]: RECLASS: 20020419033800.965
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:40:24 redwood aspre(feed)[5749]: RECLASS: 20020419033927.848
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:40:59 redwood aspre[7831]: Connection from aspre.gso.uri.edu
> > > Apr 19 04:40:59 redwood aspre[7831]: Connection reset by peer
> > > Apr 19 04:40:59 redwood aspre[7831]: Exiting
> > > Apr 19 04:41:32 redwood aspre(feed)[5749]: RECLASS: 20020419034035.825
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:42:41 redwood aspre(feed)[5749]: RECLASS: 20020419034145.432
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:43:52 redwood aspre(feed)[5749]: RECLASS: 20020419034255.858
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:44:57 redwood aspre(feed)[5749]: RECLASS: 20020419034400.718
> > > TS_ENDT {{WMO,  ".*"}}
> > > Apr 19 04:46:38 redwood aspre(feed)[5749]: SAAU31 YBBN 190300: RPC: Timed
> > > out (5)
> > > Apr 19 04:46:38 redwood aspre(feed)[5749]: pq_sequence failed: I/O error
> > > (errno = 5)
> > > Apr 19 04:46:38 redwood aspre(feed)[5749]: Exiting
> > > Apr 19 04:46:45 redwood rpc.ldmd[21539]: child 5749 exited with status 1
> > > Apr 19 04:47:09 redwood aspre[8031]: Connection from aspre.gso.uri.edu
> > > Apr 19 04:47:09 redwood aspre(feed)[8031]: Starting Up: 20020419034643.989
> > > TS_ENDT {{WMO,  ".*"}}
> > >
> > >
> > >
> > > ------- End of Forwarded Message
> > >
> > >
> >
> >
> 
>