[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Top level CONDUIT relay



Steve,

Here is a netstat on ldm1:

tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:60281 ESTABLISHED tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:60283 ESTABLISHED tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:60287 ESTABLISHED tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:60276 ESTABLISHED tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:60278 ESTABLISHED tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:53622 TIME_WAIT tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:58640 ESTABLISHED tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:58637 ESTABLISHED tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:58636 ESTABLISHED tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:58639 ESTABLISHED tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:58638 ESTABLISHED tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:61685 TIME_WAIT tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:61681 TIME_WAIT tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:61682 TIME_WAIT tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:61679 TIME_WAIT tcp 0 0 ldm1.woc.noaa.gov:388 atm.cise-nsf.gov:61678 TIME_WAIT tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:48265 TIME_WAIT tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:48264 TIME_WAIT tcp 0 0 ncepldm.woc.noaa.go:388 flood.atmos.uiuc.:48266 TIME_WAIT tcp 0 0 ldm1.local:388 node6.local:34593 ESTABLISHED

Where you can see the connections from atm and flood.

ldm2 only has atm:
tcp 0 0 ldm2.woc.noaa.gov:388 atm.cise-nsf.gov:61662 ESTABLISHED tcp 0 0 ldm2.woc.noaa.gov:388 atm.cise-nsf.gov:61687 ESTABLISHED tcp 0 0 ldm2.woc.noaa.gov:388 atm.cise-nsf.gov:61690 TIME_WAIT tcp 0 0 ldm2.woc.noaa.gov:388 atm.cise-nsf.gov:61691 ESTABLISHED tcp 0 74132 ldm2.woc.noaa.gov:388 atm.cise-nsf.gov:61689 ESTABLISHED tcp 0 0 ldm2.woc.noaa.gov:388 atm.cise-nsf.gov:61664 ESTABLISHED

Is there a way for you to check to see if flood is still receiving data from ldm1? Is atm using ldm2 still as its primary feed?

Thanks,

Justin

Can you check to see that

Chi.Y.Kang wrote:
Justin Cooke wrote:
Chi,

Was the change made to both ldm1 and ldm2?

Yes.


Justin

Chi.Y.Kang wrote:
Yes, I made the change to the LDM servers to test the shared memory
configuration.

# Setting SHMMAX Parameter 4 GB
kernel.shmmax = 4294967296
# getconf PAGE_SIZE
kernel.shmmni = 4096
kernel.shmall = 2097152

However, this doesn't explain the performance relief because...  ldm
doesn't seem to be using shared memory, or at least not listed on the
table.  Mr Cano thought LDM might be using this.

ldm1:~$ ipcs -a

------ Shared Memory Segments --------
key shmid owner perms bytes nattch status 0x00000000 0 root 600 3976 4 dest ------ Semaphore Arrays -------- key semid owner perms nsems ------ Message Queues -------- key msqid owner perms used-bytes messages
Justin Cooke wrote:
Chi,

Has anything at all changed on ldm1 since yesterday? Starting at 04Z
the feed on node6 improved dramatically, all other subscribers to ldm1
also noticed improved performance.

Justin

Steve Chiswell wrote:
Justin,

I noticed that the feeds from ldm1 dropped as you said. Do you know
if anything
changed related to that machine?

I can add daffy back to ldm1 and see if things maintain their
performance, but
will wait to find out if any changes were made? Since ldm2 is still
lagging,
seems like it is not a network wide issue?

Steve

On Thu, 21 Jun 2007, Justin Cooke wrote:

Steve,

Looking at the graphs it appears that transfers improved greatly
after
04Z today. I did a netstat on ldm1 and I still see where atm and
flood
are subscribing to it, same as yesterday.

Although looking at the latency graphs you provide it looks like
those
subscribing to ldm2 are still seeing delays.

http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+atm.cise-nsf.gov



Justin

Steve Chiswell wrote:
Justin,

I am receiving the stats from node6:
Latency:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+node6.woc.noaa.gov


Volume:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT+node6.woc.noaa.gov



The latency there to ldm1 is climbing on the initial connection, and
will start off by catching up on the last hours worth of data in the
upstream queue. After that, we can see what the latency is doing.

Steve

On Wed, 2007-06-20 at 12:43 -0400, Justin Cooke wrote:

Steve and Chi,

I tried to ping rtstats.unidata.ucar.edu but was unable to.

Chi would you be able to set up a static route from node6 to
rstats.unidata.ucar.edu like Steve mentions?

I actually am unable to connect to ncepldm.woc.noaa.gov either.
However
I did set up a feed to "ldm1" and am receiving CONDUIT data
currently.

Steve how tough would it be to do the pqact step you mention and
to get
the stats reports from those if Chi is unable to get the static
route
going?

Thanks for all the help,

Justin

On Jun 20, 2007, at 12:16 PM, Steve Chiswell wrote:


Justin,

Is that box capable of sending stats to our
rtstats.unidata.ucar.edu
host?
Eg, is it allowed to connect outside your domain?

The ldm won't need to run pqact to test out the throughput and
netwrok,
but will need ldmd.conf lines:

EXEC    "rtstats -h rtstats.unidata.ucar.edu"
request CONDUIT ".*" ncepldm.woc.noaa.gov

The pqact EXEC action can be commented out. The request
line will start the feed to ncepldm which flood.atmos.uiuc.edu is
pointing to, and showing high latency. If you are able to feed
from
ncepldm
without the latency that outside hosts are showing, then it would
isolate the
problem further to the border of your network to the outside. If
you do
show similar latency, then it would either be the LDM
configuration
itself, or the local
router that the machines are on.

If you are able to send rtstats out to us, then we can monitor
stats on
our web pages.
Your network might require a static route be added for sending
that
outside your domain (that would something your networking folks
would
know). The rtstats sends
a small text report about every 60 seconds, so not a lot of
traffic.

If you can't configure your host to send rtstats, then we could
create
q
pqact.conf action to file the .status reports and calculate the
latency
from those.

Thanks,

Steve




On Wed, 2007-06-20 at 12:03 -0400, Justin Cooke wrote:

Steve,

If you provide us a pqact.conf I can have the box chi set up to
feed
off of ldm1 and see how its latencies are.

Justin
On Jun 20, 2007, at 11:36 AM, Steve Chiswell wrote:


Justin,

Since the change at 13Z by dropping daffy.unidata.ucar.edu out
of the
top level nodes the ldm2 feed to NSF is showing little/no
latency at
all. The ldm1 feed to NSF which is connected using the
alternate LDM
mode is only devivering the .status messages its creates as all
the
other products are duplicates of products already being
received from
LDM2 and that is showing high latency:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?
CONDUIT+atm.cise-nsf.gov

This configuration is getting data out to the community at the
moment.
The downside here is that it puts a single point of failure at
NSF in
getting the data to Unidata, but
I'll monitor that end.

It seems that ldm1 is either slow, or it is showing network
limitations
(since
flood.atmos.uiuc.edu is feeding from ncepldm which is apparently
pointing to ldm1, there is load on ldm1 besides the NSF feed.
LDM2 is
feeding both NSF and idd.aos.wisc.edu (and Wisc looks good
since 13Z
as
well) so it is able to
handle the throughput to 2 downstreams, but adding daffy as the
3rd
seems to
cross some point in volume of what can be sent out.

Steve

On Wed, 2007-06-20 at 09:45 -0400, Justin Cooke wrote:

Thanks Steve,

Chi has set up a box on the lan for us to run LDM on, I am
beginning
to
get things running on there.

have you seen any improvement since dropping daffy?

Justin

On Jun 20, 2007, at 9:03 AM, Steve Chiswell wrote:


Justin,

Yes, this does appear to be the case. I will drop daffy from
feeding
directly and instead move it to feed from NSF. That will
remove one
of the top level relays of data having to go out of NCEP and
we can see if the other nodes show an improvement.

Steve

On Wed, 20 Jun 2007, Justin Cooke wrote:


Steve,

Did you see a slowdown to ldm2 after Pete and the other sites
began
making connections?

Chi, considering steve saw a good connection to ldm1
before the
other
sites connected doesn't that point toward a network issue?

All of our queue processing on the diskserver has been
running
without
any problems so I don't believe anything on that system would
impacting
ldm1/ldm2.

Justin

On Jun 20, 2007, at 12:04 AM, Chi Y Kang wrote:


I setup the test LDM server for the NCEP folks to test the
local
pull
from the LDM servers.  That should give us some
information /
network
or system related issue.  We'll handle that tomorrow.  I
am a
little
bit concerned that the slow down all occurred at the some
time as
the
ldm1 crash last week.

Also, can NCEP also check if there are any bad dbnet
queues on
the
backend servers?  Just to verify.



Steve Chiswell wrote:

Thanks Justin,
I also had a typo in my message:
ldm1 is running slower than ldm2
Now if the feed to ldm2 all of a sudden slows down if Pete
and
other
sites add a request to it, it would really signal some
sort of
total
bandwidth limitation
on the I2 connection. Seemed a little coincidental that we
had a
show
period
of good connectivity to ldm1 after which it slowed way
down.
Steve
On Tue, 2007-06-19 at 17:01 -0400, Justin Cooke wrote:

I just realized the issue. When I disabled the "pqact"
process
on
ldm2 earlier today it caused our monitor script (in cron,
every 5
min) to kill the LDM and restart it. I have removed the
check
for
the pqact in that monitor...things should be a bit better
now.

Chi.Y.Kang wrote:

Huh, i thought you guys were on the system.  let me
take a
look
on
ldm2
and see what is going on.


Justin Cooke wrote:


Chi.Y.Kang wrote:


Steve Chiswell wrote:


Pete and David,

I changed the CONDUIT request lines at NSF and
Unidata to
request data
from ldm1.woc.noaa.gov rather than
ncepldm.woc.noaa.gov
after
seeing
lots of
disconnect/reconnects to the ncepldm virtual name.

The LDM appears to have caught up here as an interim
solution.

Still don't know the cause of the problem.

Steve


I know the NCEP was stop and starting the LDM service
on the
ldm2
box
where the VIp address is pointed to at this time. how is
the
current
connection to LDM1?  is the speed of the conduit feed
acceptable?


Chi, NCEP has not restarted the LDM on ldm2 at all
today. But
looking
at the logs it appears to be dying and getting
restarted by
cron.

I will watch and see if I see anything.

Justin


--
Chi Y. Kang
Contractor
Principal Engineer
Phone: 301-713-3333 x201
Cell: 240-338-1059

--
Steve Chiswell <address@hidden>
Unidata

--
Steve Chiswell <address@hidden>
Unidata