[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Top level CONDUIT relay



Hey All,

 I've got lots of other stuff going on here at the moment, so I've been
sitting on the sidelines here with the LDM at Illinois, figuring if  start
experimenting too, it'll only throw off the metrics on what you're testing.
I'll be happy to change or try anything needed though, or even give Steve
access to do so.

 Incidentally, flood.atmos.uiuc.edu is currently connected to ncepldm, which
is != ldm1 as mentioned below (unless the same machine is answering on two
different IPs)

--------------
David Wojtowicz, Sr. Research Programmer
IT Coordinator, SESE
University of Illinois at Urbana-Champaign
address@hidden   (217) 333-8390


-----Original Message-----
From: Steve Chiswell [mailto:address@hidden] 
Sent: Wednesday, June 20, 2007 12:16 PM
To: Justin Cooke
Cc: address@hidden; address@hidden; address@hidden; Chi Y
Kang; Pete Pokrandt; Paula Freeman
Subject: Re: Top level CONDUIT relay

Justin,

The current feeds should be

Illinois connected to ncepldm=ldm1
Wisconsin connected to ldm2
NSF has a primary request to ldm2 and an alternate request to ldm1

I believe that all of those hosts use a 5 way split in the request of
data
(eg each request line asks for 20% of the data).

Are you able to use "netstat" to view the number of connections?

Steve



On Wed, 2007-06-20 at 13:10 -0400, Justin Cooke wrote:
> Steve,
> 
> That's great that you're able to see our stats.
> 
> I'm on a conference call right now with Chi and persons from NCEP, the  
> question came up of how many request feeds you have to our ldm server?
> 
> Justin
> 
> On Jun 20, 2007, at 12:54 PM, Steve Chiswell wrote:
> 
> > Justin,
> >
> > I am receiving the stats from node6:
> > Latency:
> > http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc? 
> > CONDUIT+node6.woc.noaa.gov
> > Volume:
> > http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc? 
> > CONDUIT+node6.woc.noaa.gov
> >
> > The latency there to ldm1 is climbing on the initial connection, and
> > will start off by catching up on the last hours worth of data in the
> > upstream queue. After that, we can see what the latency is doing.
> >
> > Steve
> >
> > On Wed, 2007-06-20 at 12:43 -0400, Justin Cooke wrote:
> >> Steve and Chi,
> >>
> >> I tried to ping rtstats.unidata.ucar.edu but was unable to.
> >>
> >> Chi would you be able to set up a static route from node6 to
> >> rstats.unidata.ucar.edu like Steve mentions?
> >>
> >> I actually am unable to connect to ncepldm.woc.noaa.gov either.  
> >> However
> >> I did set up a feed to "ldm1" and am receiving CONDUIT data currently.
> >>
> >> Steve how tough would it be to do the pqact step you mention and to  
> >> get
> >> the stats reports from those if Chi is unable to get the static route
> >> going?
> >>
> >> Thanks for all the help,
> >>
> >> Justin
> >>
> >> On Jun 20, 2007, at 12:16 PM, Steve Chiswell wrote:
> >>
> >>> Justin,
> >>>
> >>> Is that box capable of sending stats to our rtstats.unidata.ucar.edu
> >>> host?
> >>> Eg, is it allowed to connect outside your domain?
> >>>
> >>> The ldm won't need to run pqact to test out the throughput and  
> >>> netwrok,
> >>> but will need ldmd.conf lines:
> >>>
> >>> EXEC    "rtstats -h rtstats.unidata.ucar.edu"
> >>> request CONDUIT ".*" ncepldm.woc.noaa.gov
> >>>
> >>> The pqact EXEC action can be commented out. The request
> >>> line will start the feed to ncepldm which flood.atmos.uiuc.edu is
> >>> pointing to, and showing high latency. If you are able to feed from
> >>> ncepldm
> >>> without the latency that outside hosts are showing, then it would
> >>> isolate the
> >>> problem further to the border of your network to the outside. If you  
> >>> do
> >>> show similar latency, then it would either be the LDM configuration
> >>> itself, or the local
> >>> router that the machines are on.
> >>>
> >>> If you are able to send rtstats out to us, then we can monitor stats  
> >>> on
> >>> our web pages.
> >>> Your network might require a static route be added for sending that
> >>> outside your domain (that would something your networking folks would
> >>> know). The rtstats sends
> >>> a small text report about every 60 seconds, so not a lot of traffic.
> >>>
> >>> If you can't configure your host to send rtstats, then we could  
> >>> create
> >>> q
> >>> pqact.conf action to file the .status reports and calculate the  
> >>> latency
> >>> from those.
> >>>
> >>> Thanks,
> >>>
> >>> Steve
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, 2007-06-20 at 12:03 -0400, Justin Cooke wrote:
> >>>> Steve,
> >>>>
> >>>> If you provide us a pqact.conf I can have the box chi set up to feed
> >>>> off of ldm1 and see how its latencies are.
> >>>>
> >>>> Justin
> >>>> On Jun 20, 2007, at 11:36 AM, Steve Chiswell wrote:
> >>>>
> >>>>> Justin,
> >>>>>
> >>>>> Since the change at 13Z by dropping daffy.unidata.ucar.edu out of  
> >>>>> the
> >>>>> top level nodes the ldm2 feed to NSF is showing little/no latency  
> >>>>> at
> >>>>> all. The ldm1 feed to NSF which is connected using the alternate  
> >>>>> LDM
> >>>>> mode is only devivering the .status messages its creates as all the
> >>>>> other products are duplicates of products already being received  
> >>>>> from
> >>>>> LDM2 and that is showing high latency:
> >>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?
> >>>>> CONDUIT+atm.cise-nsf.gov
> >>>>>
> >>>>> This configuration is getting data out to the community at the
> >>>>> moment.
> >>>>> The downside here is that it puts a single point of failure at NSF  
> >>>>> in
> >>>>> getting the data to Unidata, but
> >>>>> I'll monitor that end.
> >>>>>
> >>>>> It seems that ldm1 is either slow, or it is showing network
> >>>>> limitations
> >>>>> (since
> >>>>> flood.atmos.uiuc.edu is feeding from ncepldm which is apparently
> >>>>> pointing to ldm1, there is load on ldm1 besides the NSF feed. LDM2  
> >>>>> is
> >>>>> feeding both NSF and idd.aos.wisc.edu (and Wisc looks good since  
> >>>>> 13Z
> >>>>> as
> >>>>> well) so it is able to
> >>>>> handle the throughput to 2 downstreams, but adding daffy as the 3rd
> >>>>> seems to
> >>>>> cross some point in volume of what can be sent out.
> >>>>>
> >>>>> Steve
> >>>>>
> >>>>> On Wed, 2007-06-20 at 09:45 -0400, Justin Cooke wrote:
> >>>>>> Thanks Steve,
> >>>>>>
> >>>>>> Chi has set up a box on the lan for us to run LDM on, I am  
> >>>>>> beginning
> >>>>>> to
> >>>>>> get things running on there.
> >>>>>>
> >>>>>> have you seen any improvement since dropping daffy?
> >>>>>>
> >>>>>> Justin
> >>>>>>
> >>>>>> On Jun 20, 2007, at 9:03 AM, Steve Chiswell wrote:
> >>>>>>
> >>>>>>> Justin,
> >>>>>>>
> >>>>>>> Yes, this does appear to be the case. I will drop daffy from
> >>>>>>> feeding
> >>>>>>> directly and instead move it to feed from NSF. That will remove  
> >>>>>>> one
> >>>>>>> of the top level relays of data having to go out of NCEP and
> >>>>>>> we can see if the other nodes show an improvement.
> >>>>>>>
> >>>>>>> Steve
> >>>>>>>
> >>>>>>> On Wed, 20 Jun 2007, Justin Cooke wrote:
> >>>>>>>
> >>>>>>>> Steve,
> >>>>>>>>
> >>>>>>>> Did you see a slowdown to ldm2 after Pete and the other sites
> >>>>>>>> began
> >>>>>>>> making connections?
> >>>>>>>>
> >>>>>>>> Chi, considering steve saw a good connection to ldm1 before the
> >>>>>>>> other
> >>>>>>>> sites connected doesn't that point toward a network issue?
> >>>>>>>>
> >>>>>>>> All of our queue processing on the diskserver has been running
> >>>>>>>> without
> >>>>>>>> any problems so I don't believe anything on that system would
> >>>>>>>> impacting
> >>>>>>>> ldm1/ldm2.
> >>>>>>>>
> >>>>>>>> Justin
> >>>>>>>>
> >>>>>>>> On Jun 20, 2007, at 12:04 AM, Chi Y Kang wrote:
> >>>>>>>>
> >>>>>>>>> I setup the test LDM server for the NCEP folks to test the  
> >>>>>>>>> local
> >>>>>>>>> pull
> >>>>>>>>> from the LDM servers.  That should give us some information /
> >>>>>>>>> network
> >>>>>>>>> or system related issue.  We'll handle that tomorrow.  I am a
> >>>>>>>>> little
> >>>>>>>>> bit concerned that the slow down all occurred at the some time  
> >>>>>>>>> as
> >>>>>>>>> the
> >>>>>>>>> ldm1 crash last week.
> >>>>>>>>>
> >>>>>>>>> Also, can NCEP also check if there are any bad dbnet queues on
> >>>>>>>>> the
> >>>>>>>>> backend servers?  Just to verify.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>> Thanks Justin,
> >>>>>>>>>> I also had a typo in my message:
> >>>>>>>>>> ldm1 is running slower than ldm2
> >>>>>>>>>> Now if the feed to ldm2 all of a sudden slows down if Pete and
> >>>>>>>>>> other
> >>>>>>>>>> sites add a request to it, it would really signal some sort of
> >>>>>>>>>> total
> >>>>>>>>>> bandwidth limitation
> >>>>>>>>>> on the I2 connection. Seemed a little coincidental that we  
> >>>>>>>>>> had a
> >>>>>>>>>> show
> >>>>>>>>>> period
> >>>>>>>>>> of good connectivity to ldm1 after which it slowed way down.
> >>>>>>>>>> Steve
> >>>>>>>>>> On Tue, 2007-06-19 at 17:01 -0400, Justin Cooke wrote:
> >>>>>>>>>>> I just realized the issue. When I disabled the "pqact"  
> >>>>>>>>>>> process
> >>>>>>>>>>> on
> >>>>>>>>>>> ldm2 earlier today it caused our monitor script (in cron,
> >>>>>>>>>>> every 5
> >>>>>>>>>>> min) to kill the LDM and restart it. I have removed the check
> >>>>>>>>>>> for
> >>>>>>>>>>> the pqact in that monitor...things should be a bit better  
> >>>>>>>>>>> now.
> >>>>>>>>>>>
> >>>>>>>>>>> Chi.Y.Kang wrote:
> >>>>>>>>>>>> Huh, i thought you guys were on the system.  let me take a
> >>>>>>>>>>>> look
> >>>>>>>>>>>> on
> >>>>>>>>>>>> ldm2
> >>>>>>>>>>>> and see what is going on.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Justin Cooke wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Chi.Y.Kang wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Pete and David,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I changed the CONDUIT request lines at NSF and Unidata to
> >>>>>>>>>>>>>>> request data
> >>>>>>>>>>>>>>> from ldm1.woc.noaa.gov rather than ncepldm.woc.noaa.gov
> >>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>> seeing
> >>>>>>>>>>>>>>> lots of
> >>>>>>>>>>>>>>> disconnect/reconnects to the ncepldm virtual name.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The LDM appears to have caught up here as an interim
> >>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Still don't know the cause of the problem.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Steve
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I know the NCEP was stop and starting the LDM service on  
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> ldm2
> >>>>>>>>>>>>>> box
> >>>>>>>>>>>>>> where the VIp address is pointed to at this time.  how is
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> current
> >>>>>>>>>>>>>> connection to LDM1?  is the speed of the conduit feed
> >>>>>>>>>>>>>> acceptable?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Chi, NCEP has not restarted the LDM on ldm2 at all today.  
> >>>>>>>>>>>>> But
> >>>>>>>>>>>>> looking
> >>>>>>>>>>>>> at the logs it appears to be dying and getting restarted by
> >>>>>>>>>>>>> cron.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I will watch and see if I see anything.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Chi Y. Kang
> >>>>>>>>> Contractor
> >>>>>>>>> Principal Engineer
> >>>>>>>>> Phone: 301-713-3333 x201
> >>>>>>>>> Cell: 240-338-1059
> >>>>>>>>
> >>>>> -- 
> >>>>> Steve Chiswell <address@hidden>
> >>>>> Unidata
> >>> -- 
> >>> Steve Chiswell <address@hidden>
> >>> Unidata
> > -- 
> > Steve Chiswell <address@hidden>
> > Unidata
-- 
Steve Chiswell <address@hidden>
Unidata