[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[CONDUIT #GSZ-336115]: Large conduit lags have reappeared



Hi Carissa,

re:
> Our Boulder data center is offline right now, that is why woc is not
> responding.

OK.

re:
> It should be back online by Friday.

Thanks.

re:
> Let me know if you need any assistance with these latency issues.

Something is definitely wrong.  The problem is we don't know what that
might be.

Observations:

- the machine we operate to REQUEST CONDUIT from both conduit.ncep.noaa.gov
  and ncepldm4.woc.noaa.gov is basically idling (meaning that its load average
  is VERY low, like 0.05 for the 5-minute load number reported by 'top'

  So, any latencies we experience on this machine (daffy.unidata.ucar.edu) are
  not likely to be caused by the machine itself.  The latencies that daffy is
  seeing from conduit.ncep.noaa.gov parallel those being experienced at Penn 
State
  and UW/AOS.  Here is the URL for daffy's CONDUIT latency graph:

http://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+daffy.unidata.ucar.edu

  Here is the same kind of plot for the UW/AOS machine:

http://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd.aos.wisc.edu

  I would list Penn State's machine, but Art stopped REQUESTing/relaying the 
0.25
  degree GFS data yesterday.

- this morning, I changed the 5-way split of REQUEST lines for CONDUIT data to a
  10-way split to see if this would result in reduced latencies seen on daffy

  The verdict is not in on this test, but there seems to be some reduction of
  latencies being experienced.  If this trend continues it might mean that
  there is some sort of per-connection bandwidth limitation going on somewhere.
  
  The "classic" example of bandwidth limitation is when there is "packet 
shaping"
  going on.

  Another classic example is when there is simply not enough bandwidth
  to handle the volume of data being relayed

- A review of the traffic flowing out of the conduit.ncep.noaa.gov cluster
  and through any/all switches seems to be in order at this point.  I say
  this because our monitoring of the bandwith flowing out of our top level
  IDD relay cluster, idd.unidata.ucar.edu, showed us that the volumes would
  plateau on some of the real server backends, and this plateauing effect
  was not a function of a maximum in the data that was being sent.  We learned
  that some of our relay cluster backend machines were connected to UCAR
  switches that had a 1 Gbps maximum, and there were other machines that
  were also using lots of bandwidth on those switches.  We also learned
  that the volume of data that both our cluster front end "accumulator"
  machines and our backend real servers was greater than 1 Gbps for substantial
  times of the day.  We worked with the network folks to move our machines
  to lessen impacts by other machines, and we bonded two Gbps Ethernet
  interfaces together on each of the machines so that up to 2 Gbps could
  be sent from each machine.

Questions about the NCEP CONDUIT top leve setup:

- how many downstream machines are currently REQUESTing CONDUIT?

  And, is it possible that the sum of the volume of data attempting to
  be relayed exceeds 1 Gbps (or whatever your Ethernet interface and
  network supports).  If yes, this is a possible source of the latency
  problem.

- how much other traffic is on the NCEP network where conduit.ncep.noaa.gov
  operates?

- what is the network capacity for the network through which the CONDUIT
  data is flowing?

- the real-time stats plots indicates/suggests that the source of the CONDUIT 
data
  is/are virtual machines.  for example:

  vm-lnx-conduit2.ncep.noaa.gov

  Is(are) this(these) machine(s) really VMs?  If yes, is it possible that
  there is some sort of limitation in the VM networking?

I am sure that there are other questions that should be posed at this point, but
I think that it is important for the above to get through soon so that someone
on your side (Data Flow team?) can start thinking about potential bottlenecks
there.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: GSZ-336115
Department: Support CONDUIT
Priority: Normal
Status: Closed