[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: issues with LDM

Subject: Re: issues with LDM
Date: Mon, 18 Jun 2007 07:38:00 -0600

Hi Justin,

The latencies are still looking bad using ldm2. This has lead to thecontinuation of large numbersof missing fields (identical on both receiving machines, NCAR andUnidata) from all model cycles over the weekend,

continuing this morning.

http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd.unidata.ucar.edu+LOG


Doug

On Jun 18, 2007, at 6:32 AM, Justin Cooke wrote:

You're right Chi, I misread the graph, we are on ldm2 and have beensince Saturday. My apologies.
Justin

Chi Y Kang wrote:
Justin Cooke wrote:
Chi,
It looks like you switched us back to ldm1 on Saturday butaccording to the graphs from Steve they experienced the same delays.
Running on ldm2 right now. It looks like the send-Q on our endseems to be okay. All of these connections are going via I2, letme see what rate limit there is set on I2 connection coming out ofthe campus.
Justin

Steve Chiswell wrote:
Chi & Justin,
The latency of data today has been high like yesterday, evenwith the switch ofldm2. The throughput looks restricted either by a router orfirewall/packetshaping, but was wondering if coincident with Justin's restartwas that theconnections had to be re-established, so changes took effect atthat time.
Thanks for all your efforts,

Steve Chiswell
Unidata User Support



On Fri, 15 Jun 2007, Chi Y Kang wrote:
Wait a minute here,

128.117.140.208 isn't in the mix.  The other hosts are.
I updated the LDM access list. Should we just have some classC ranges
to have access rather then ip at a time?
Also, i noticed that the send-Q are pretty normal on ldm2server rightnow but was pretty high on ldm1. might be just an issue withthe ACL
list.


128.117.12.2
128.117.12.3
128.117.130.220
128.117.140.208
128.117.140.220
128.117.149.220
128.117.156.220
128.174.80.16
128.174.80.47
140.90.193.19
140.90.193.227
140.90.193.228
140.90.193.99
140.90.226.201
140.90.226.202
140.90.226.203
140.90.226.204
140.90.37.12
140.90.37.13
140.90.37.15
140.90.37.16
140.90.37.40
144.92.130.88
144.92.131.244
150.9.117.128
192.12.209.57
192.58.3.194
192.58.3.195
192.58.3.196
192.58.3.197
193.61.196.74
198.181.231.53
208.64.117.128


Justin Cooke wrote:
Chi,
The reboot doesn't seem to have helped. Is there anything elsethat maybe causing these issues? Network related after I performed therestart
of LDM? Steve has a few possibilities:

/It seems to be network related at your end, but strange that it
occurred at the time when you retsrtaed the LDM- unless therewas some
sort of firewall or packet filter that occurred when the LDM's
re-connected. /

Justin

Steve Chiswell wrote:
Justin,
I haven't seen any improvement from ncepldm to the top levelrelaysdaffy.unidata.ucar.edu (Unidata), idd.aos.wisc.edu (U.WIsconsin),flood.atmos.uiuc.edu (U. Illinois) or atm.cise-nsf.gov (NSF,DC).
It seems to be network related at your end, but strange thatit occurredat the time when you retsrtaed the LDM- unless there was somesort offirewall or packet filter that occurred when the LDM's re-connected.
Thanks for your time in looking at this,

Steve


On Fri, 2007-06-15 at 15:31 -0400, Justin Cooke wrote:
Steve and Doug,
I just got a call from Chi at the WOC, he rebooted LDM1after noticingan unusual load on the machine. LDM is again running on thatbox and itremains primary, can you check to see how the latencies arenow?
Thanks,

Justin

Doug Schuster wrote:
Justin,
28,079 products are missing from the 12z cycle. You'll begetting the
automated email shortly.

-Doug

On Jun 15, 2007, at 12:48 PM, Justin Cooke wrote:
Steve,

I've turned off the feed to LDM2.

There is no other load on the ldm1 system except for LDM.

Doug, are you missing many of the TIGGE params for 12Z?

Justin

Steve Chiswell wrote:
Justin,

That didn't change the behavior. Still seeing latency.
perhaps turning off the other feed. Is there any load
other than LDM on the system?

Steve


On Fri, 2007-06-15 at 12:56 -0400, Justin Cooke wrote:
Steve,
I've recreated the queue, let me know if you are stillseeing issues.
If so I'll turn off the feed to ldm2 to see if thatcorrects things.
Justin

Steve Chiswell wrote:
Justin,

I don't know if they saw a disk space problem with
log files not being rotated, but it might just be
best today to build a new queue:

ldmadmin stop
ldmadmin delqueue
ldmadmin mkqueue
ldmadmin start
That will mean some queued data would be lost, but ifusers aren't
getting it
anyway, then its best to ensure that the queue isn'tcorrupt for the
weekend.

Happy Friday....

Thanks,

Steve


On Fri, 2007-06-15 at 12:13 -0400, Justin Cooke wrote:
Steve,
Our logs on the primary ldm system "ldm1" had notrotated fornearly a week. I sent email to the WOC support andthis was the
response:
Looks like the seed file was missing after we broughtthe system
backup
from the last outage.  should be good now.

Justin Cooke wrote:
WOC,
I noticed that our logs for LDM have not been rotatedon machine
ldm1
since 06/05/2007. We have a cron entry that runs"ldmadmin
newlog" at
00Z every day.
I attempted to run the command by hand and got thefollowing back:
ldm@ldm1:~$ bin/ldmadmin newlog
hupsyslog: couldn't open /var/run/syslogd.pid
I checked but /var/run/syslogd.pid is not there butit is on ldm2.
Could there be a problem with syslogd on ldm1?

Justin
Also around that time I turned on our backup feed tothe ldm2system which had been off since that system had issuesa fewweeks ago (we were asked by WOC to turn it back on). Ihave sentemail to their support group asking if both ldm1 andldm2 areresponding to the ncepldm.woc.noaa.gov address or ifsomething
else is going on.

Justin

Steve Chiswell wrote:
Justin,
Yesterday just after 18Z, the data flow fromncepldm.woc.noaa.govto top level sites at NSF and Unidata both beganshowing high
latency:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
+atm.cise-nsf.gov
 and
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
+daffy.unidata.ucar.edu

Data volume out has dropped as a result:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT
+atm.cise-nsf.gov

Since the behavior is similar at both sites at separate
locations, the
problem would appear to be near your end. Since thatcoincides
with your
restart of the LDM, could you fill me in on theissues you were
experiencing?

Thanks

Steve Chiswell
Unidata User Support



On Fri, 2007-06-15 at 11:38 -0400, Justin Cooke wrote:
Doug,
I had to restart our LDM yesterday right before the18Z cycle,we had an issue with out logging but none of theconfigurationfiles changed. Could one of your feeds have lost theconnection
to our LDM during that restart?

Justin

Douglas Schuster wrote:
Yes, we've received partial cycles. More thanhalf of the
expected fields have been missing
in each cycle from June 14 18Z, to June 15, 06Z.The number
of missing fields varies between
each cycle.

Doug

On Jun 15, 2007, at 9:11 AM, Justin Cooke wrote:
Doug,
Have you received any GEFS data from us today? Oris it just
certain fields you are missing?

Justin
--
Chi Y. Kang
Contractor
Principal Engineer
Phone: 301-713-3333 x201
Cell: 240-338-1059

Follow-Ups:
- Re: issues with LDM
  - From: Chi.Y.Kang

References:
- Re: issues with LDM
  - From: Steve Chiswell
- Re: issues with LDM
  - From: Steve Chiswell
- Re: issues with LDM
  - From: Steve Chiswell
- Re: issues with LDM
  - From: Steve Chiswell
- Re: issues with LDM
  - From: Steve Chiswell

Prev by Date: Re: issues with LDM
Next by Date: Re: issues with LDM
Previous by thread: Re: issues with LDM
Next by thread: Re: issues with LDM
Index(es):
- Date
- Thread