[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[CONDUIT #GSZ-336115]: Large conduit lags have reappeared

Subject: [CONDUIT #GSZ-336115]: Large conduit lags have reappeared
Date: Thu, 10 Sep 2015 13:41:25 -0600

Hi Carissa,

re:
> Our Boulder data center is offline right now, that is why woc is not
> responding.

OK.

re:
> It should be back online by Friday.

Thanks.

re:
> Let me know if you need any assistance with these latency issues.

Something is definitely wrong. The problem is we don't know what that
might be.

Observations:

- the machine we operate to REQUEST CONDUIT from both conduit.ncep.noaa.gov
and ncepldm4.woc.noaa.gov is basically idling (meaning that its load average
is VERY low, like 0.05 for the 5-minute load number reported by 'top'

So, any latencies we experience on this machine (daffy.unidata.ucar.edu) are
not likely to be caused by the machine itself. The latencies that daffy is
seeing from conduit.ncep.noaa.gov parallel those being experienced at Penn
State
and UW/AOS. Here is the URL for daffy's CONDUIT latency graph:

http://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+daffy.unidata.ucar.edu

Here is the same kind of plot for the UW/AOS machine:

http://rtstats.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd.aos.wisc.edu

I would list Penn State's machine, but Art stopped REQUESTing/relaying the
0.25
degree GFS data yesterday.

- this morning, I changed the 5-way split of REQUEST lines for CONDUIT data to a
10-way split to see if this would result in reduced latencies seen on daffy

The verdict is not in on this test, but there seems to be some reduction of
latencies being experienced. If this trend continues it might mean that
there is some sort of per-connection bandwidth limitation going on somewhere.

The "classic" example of bandwidth limitation is when there is "packet
shaping"
going on.

Another classic example is when there is simply not enough bandwidth
to handle the volume of data being relayed

- A review of the traffic flowing out of the conduit.ncep.noaa.gov cluster
and through any/all switches seems to be in order at this point. I say
this because our monitoring of the bandwith flowing out of our top level
IDD relay cluster, idd.unidata.ucar.edu, showed us that the volumes would
plateau on some of the real server backends, and this plateauing effect
was not a function of a maximum in the data that was being sent. We learned
that some of our relay cluster backend machines were connected to UCAR
switches that had a 1 Gbps maximum, and there were other machines that
were also using lots of bandwidth on those switches. We also learned
that the volume of data that both our cluster front end "accumulator"
machines and our backend real servers was greater than 1 Gbps for substantial
times of the day. We worked with the network folks to move our machines
to lessen impacts by other machines, and we bonded two Gbps Ethernet
interfaces together on each of the machines so that up to 2 Gbps could
be sent from each machine.

Questions about the NCEP CONDUIT top leve setup:

- how many downstream machines are currently REQUESTing CONDUIT?

And, is it possible that the sum of the volume of data attempting to
be relayed exceeds 1 Gbps (or whatever your Ethernet interface and
network supports). If yes, this is a possible source of the latency
problem.

- how much other traffic is on the NCEP network where conduit.ncep.noaa.gov
operates?

- what is the network capacity for the network through which the CONDUIT
data is flowing?

- the real-time stats plots indicates/suggests that the source of the CONDUIT
data
is/are virtual machines. for example:

vm-lnx-conduit2.ncep.noaa.gov

Is(are) this(these) machine(s) really VMs? If yes, is it possible that
there is some sort of limitation in the VM networking?

I am sure that there are other questions that should be posed at this point, but
I think that it is important for the above to get through soon so that someone
on your side (Data Flow team?) can start thinking about potential bottlenecks
there.

Cheers,

Tom
--
****************************************************************************
Unidata User Support UCAR Unidata Program
(303) 497-8642 P.O. Box 3000
address@hidden Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage http://www.unidata.ucar.edu
****************************************************************************

Ticket Details
===================
Ticket ID: GSZ-336115
Department: Support CONDUIT
Priority: Normal
Status: Closed

Prev by Date: Re: [conduit] How's your GFS?
Next by Date: Re: [conduit] Emergency downtime of idd.aos.wisc.edu around 1 or 1:30 PM CDT today
Previous by thread: [CONDUIT #GSZ-336115]: Large conduit lags have reappeared
Next by thread: Re: [conduit] Emergency downtime of idd.aos.wisc.edu around 1 or 1:30 PM CDT today
Index(es):
- Date
- Thread