[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Top level CONDUIT relay



Justin,

The current feeds should be

Illinois connected to ncepldm=ldm1
Wisconsin connected to ldm2
NSF has a primary request to ldm2 and an alternate request to ldm1

I believe that all of those hosts use a 5 way split in the request of
data
(eg each request line asks for 20% of the data).

Are you able to use "netstat" to view the number of connections?

Steve



On Wed, 2007-06-20 at 13:10 -0400, Justin Cooke wrote:
> Steve,
> 
> That's great that you're able to see our stats.
> 
> I'm on a conference call right now with Chi and persons from NCEP, the  
> question came up of how many request feeds you have to our ldm server?
> 
> Justin
> 
> On Jun 20, 2007, at 12:54 PM, Steve Chiswell wrote:
> 
> > Justin,
> >
> > I am receiving the stats from node6:
> > Latency:
> > http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc? 
> > CONDUIT+node6.woc.noaa.gov
> > Volume:
> > http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc? 
> > CONDUIT+node6.woc.noaa.gov
> >
> > The latency there to ldm1 is climbing on the initial connection, and
> > will start off by catching up on the last hours worth of data in the
> > upstream queue. After that, we can see what the latency is doing.
> >
> > Steve
> >
> > On Wed, 2007-06-20 at 12:43 -0400, Justin Cooke wrote:
> >> Steve and Chi,
> >>
> >> I tried to ping rtstats.unidata.ucar.edu but was unable to.
> >>
> >> Chi would you be able to set up a static route from node6 to
> >> rstats.unidata.ucar.edu like Steve mentions?
> >>
> >> I actually am unable to connect to ncepldm.woc.noaa.gov either.  
> >> However
> >> I did set up a feed to "ldm1" and am receiving CONDUIT data currently.
> >>
> >> Steve how tough would it be to do the pqact step you mention and to  
> >> get
> >> the stats reports from those if Chi is unable to get the static route
> >> going?
> >>
> >> Thanks for all the help,
> >>
> >> Justin
> >>
> >> On Jun 20, 2007, at 12:16 PM, Steve Chiswell wrote:
> >>
> >>> Justin,
> >>>
> >>> Is that box capable of sending stats to our rtstats.unidata.ucar.edu
> >>> host?
> >>> Eg, is it allowed to connect outside your domain?
> >>>
> >>> The ldm won't need to run pqact to test out the throughput and  
> >>> netwrok,
> >>> but will need ldmd.conf lines:
> >>>
> >>> EXEC    "rtstats -h rtstats.unidata.ucar.edu"
> >>> request CONDUIT ".*" ncepldm.woc.noaa.gov
> >>>
> >>> The pqact EXEC action can be commented out. The request
> >>> line will start the feed to ncepldm which flood.atmos.uiuc.edu is
> >>> pointing to, and showing high latency. If you are able to feed from
> >>> ncepldm
> >>> without the latency that outside hosts are showing, then it would
> >>> isolate the
> >>> problem further to the border of your network to the outside. If you  
> >>> do
> >>> show similar latency, then it would either be the LDM configuration
> >>> itself, or the local
> >>> router that the machines are on.
> >>>
> >>> If you are able to send rtstats out to us, then we can monitor stats  
> >>> on
> >>> our web pages.
> >>> Your network might require a static route be added for sending that
> >>> outside your domain (that would something your networking folks would
> >>> know). The rtstats sends
> >>> a small text report about every 60 seconds, so not a lot of traffic.
> >>>
> >>> If you can't configure your host to send rtstats, then we could  
> >>> create
> >>> q
> >>> pqact.conf action to file the .status reports and calculate the  
> >>> latency
> >>> from those.
> >>>
> >>> Thanks,
> >>>
> >>> Steve
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, 2007-06-20 at 12:03 -0400, Justin Cooke wrote:
> >>>> Steve,
> >>>>
> >>>> If you provide us a pqact.conf I can have the box chi set up to feed
> >>>> off of ldm1 and see how its latencies are.
> >>>>
> >>>> Justin
> >>>> On Jun 20, 2007, at 11:36 AM, Steve Chiswell wrote:
> >>>>
> >>>>> Justin,
> >>>>>
> >>>>> Since the change at 13Z by dropping daffy.unidata.ucar.edu out of  
> >>>>> the
> >>>>> top level nodes the ldm2 feed to NSF is showing little/no latency  
> >>>>> at
> >>>>> all. The ldm1 feed to NSF which is connected using the alternate  
> >>>>> LDM
> >>>>> mode is only devivering the .status messages its creates as all the
> >>>>> other products are duplicates of products already being received  
> >>>>> from
> >>>>> LDM2 and that is showing high latency:
> >>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?
> >>>>> CONDUIT+atm.cise-nsf.gov
> >>>>>
> >>>>> This configuration is getting data out to the community at the
> >>>>> moment.
> >>>>> The downside here is that it puts a single point of failure at NSF  
> >>>>> in
> >>>>> getting the data to Unidata, but
> >>>>> I'll monitor that end.
> >>>>>
> >>>>> It seems that ldm1 is either slow, or it is showing network
> >>>>> limitations
> >>>>> (since
> >>>>> flood.atmos.uiuc.edu is feeding from ncepldm which is apparently
> >>>>> pointing to ldm1, there is load on ldm1 besides the NSF feed. LDM2  
> >>>>> is
> >>>>> feeding both NSF and idd.aos.wisc.edu (and Wisc looks good since  
> >>>>> 13Z
> >>>>> as
> >>>>> well) so it is able to
> >>>>> handle the throughput to 2 downstreams, but adding daffy as the 3rd
> >>>>> seems to
> >>>>> cross some point in volume of what can be sent out.
> >>>>>
> >>>>> Steve
> >>>>>
> >>>>> On Wed, 2007-06-20 at 09:45 -0400, Justin Cooke wrote:
> >>>>>> Thanks Steve,
> >>>>>>
> >>>>>> Chi has set up a box on the lan for us to run LDM on, I am  
> >>>>>> beginning
> >>>>>> to
> >>>>>> get things running on there.
> >>>>>>
> >>>>>> have you seen any improvement since dropping daffy?
> >>>>>>
> >>>>>> Justin
> >>>>>>
> >>>>>> On Jun 20, 2007, at 9:03 AM, Steve Chiswell wrote:
> >>>>>>
> >>>>>>> Justin,
> >>>>>>>
> >>>>>>> Yes, this does appear to be the case. I will drop daffy from
> >>>>>>> feeding
> >>>>>>> directly and instead move it to feed from NSF. That will remove  
> >>>>>>> one
> >>>>>>> of the top level relays of data having to go out of NCEP and
> >>>>>>> we can see if the other nodes show an improvement.
> >>>>>>>
> >>>>>>> Steve
> >>>>>>>
> >>>>>>> On Wed, 20 Jun 2007, Justin Cooke wrote:
> >>>>>>>
> >>>>>>>> Steve,
> >>>>>>>>
> >>>>>>>> Did you see a slowdown to ldm2 after Pete and the other sites
> >>>>>>>> began
> >>>>>>>> making connections?
> >>>>>>>>
> >>>>>>>> Chi, considering steve saw a good connection to ldm1 before the
> >>>>>>>> other
> >>>>>>>> sites connected doesn't that point toward a network issue?
> >>>>>>>>
> >>>>>>>> All of our queue processing on the diskserver has been running
> >>>>>>>> without
> >>>>>>>> any problems so I don't believe anything on that system would
> >>>>>>>> impacting
> >>>>>>>> ldm1/ldm2.
> >>>>>>>>
> >>>>>>>> Justin
> >>>>>>>>
> >>>>>>>> On Jun 20, 2007, at 12:04 AM, Chi Y Kang wrote:
> >>>>>>>>
> >>>>>>>>> I setup the test LDM server for the NCEP folks to test the  
> >>>>>>>>> local
> >>>>>>>>> pull
> >>>>>>>>> from the LDM servers.  That should give us some information /
> >>>>>>>>> network
> >>>>>>>>> or system related issue.  We'll handle that tomorrow.  I am a
> >>>>>>>>> little
> >>>>>>>>> bit concerned that the slow down all occurred at the some time  
> >>>>>>>>> as
> >>>>>>>>> the
> >>>>>>>>> ldm1 crash last week.
> >>>>>>>>>
> >>>>>>>>> Also, can NCEP also check if there are any bad dbnet queues on
> >>>>>>>>> the
> >>>>>>>>> backend servers?  Just to verify.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>> Thanks Justin,
> >>>>>>>>>> I also had a typo in my message:
> >>>>>>>>>> ldm1 is running slower than ldm2
> >>>>>>>>>> Now if the feed to ldm2 all of a sudden slows down if Pete and
> >>>>>>>>>> other
> >>>>>>>>>> sites add a request to it, it would really signal some sort of
> >>>>>>>>>> total
> >>>>>>>>>> bandwidth limitation
> >>>>>>>>>> on the I2 connection. Seemed a little coincidental that we  
> >>>>>>>>>> had a
> >>>>>>>>>> show
> >>>>>>>>>> period
> >>>>>>>>>> of good connectivity to ldm1 after which it slowed way down.
> >>>>>>>>>> Steve
> >>>>>>>>>> On Tue, 2007-06-19 at 17:01 -0400, Justin Cooke wrote:
> >>>>>>>>>>> I just realized the issue. When I disabled the "pqact"  
> >>>>>>>>>>> process
> >>>>>>>>>>> on
> >>>>>>>>>>> ldm2 earlier today it caused our monitor script (in cron,
> >>>>>>>>>>> every 5
> >>>>>>>>>>> min) to kill the LDM and restart it. I have removed the check
> >>>>>>>>>>> for
> >>>>>>>>>>> the pqact in that monitor...things should be a bit better  
> >>>>>>>>>>> now.
> >>>>>>>>>>>
> >>>>>>>>>>> Chi.Y.Kang wrote:
> >>>>>>>>>>>> Huh, i thought you guys were on the system.  let me take a
> >>>>>>>>>>>> look
> >>>>>>>>>>>> on
> >>>>>>>>>>>> ldm2
> >>>>>>>>>>>> and see what is going on.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Justin Cooke wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Chi.Y.Kang wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Pete and David,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I changed the CONDUIT request lines at NSF and Unidata to
> >>>>>>>>>>>>>>> request data
> >>>>>>>>>>>>>>> from ldm1.woc.noaa.gov rather than ncepldm.woc.noaa.gov
> >>>>>>>>>>>>>>> after
> >>>>>>>>>>>>>>> seeing
> >>>>>>>>>>>>>>> lots of
> >>>>>>>>>>>>>>> disconnect/reconnects to the ncepldm virtual name.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The LDM appears to have caught up here as an interim
> >>>>>>>>>>>>>>> solution.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Still don't know the cause of the problem.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Steve
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I know the NCEP was stop and starting the LDM service on  
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> ldm2
> >>>>>>>>>>>>>> box
> >>>>>>>>>>>>>> where the VIp address is pointed to at this time.  how is
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>> current
> >>>>>>>>>>>>>> connection to LDM1?  is the speed of the conduit feed
> >>>>>>>>>>>>>> acceptable?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Chi, NCEP has not restarted the LDM on ldm2 at all today.  
> >>>>>>>>>>>>> But
> >>>>>>>>>>>>> looking
> >>>>>>>>>>>>> at the logs it appears to be dying and getting restarted by
> >>>>>>>>>>>>> cron.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I will watch and see if I see anything.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Chi Y. Kang
> >>>>>>>>> Contractor
> >>>>>>>>> Principal Engineer
> >>>>>>>>> Phone: 301-713-3333 x201
> >>>>>>>>> Cell: 240-338-1059
> >>>>>>>>
> >>>>> -- 
> >>>>> Steve Chiswell <address@hidden>
> >>>>> Unidata
> >>> -- 
> >>> Steve Chiswell <address@hidden>
> >>> Unidata
> > -- 
> > Steve Chiswell <address@hidden>
> > Unidata
-- 
Steve Chiswell <address@hidden>
Unidata