[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: issues with LDM



Chi & Justin,

The latency of data today has been high like yesterday, even with the switch of
ldm2. The throughput looks restricted either by a router or firewall/packet
shaping, but was wondering if coincident with Justin's restart was that the
connections had to be re-established, so changes took effect at that time.

Thanks for all your efforts,

Steve Chiswell
Unidata User Support



On Fri, 15 Jun 2007, Chi Y Kang wrote:

> Wait a minute here,
>
> 128.117.140.208 isn't in the mix.  The other hosts are.
>
> I updated the LDM access list.  Should we just have some class C ranges
> to have access rather then ip at a time?
>
> Also, i noticed that the send-Q are pretty normal on ldm2 server right
> now  but was pretty high on ldm1.  might be just an issue with the ACL
> list.
>
>
> 128.117.12.2
> 128.117.12.3
> 128.117.130.220
> 128.117.140.208
> 128.117.140.220
> 128.117.149.220
> 128.117.156.220
> 128.174.80.16
> 128.174.80.47
> 140.90.193.19
> 140.90.193.227
> 140.90.193.228
> 140.90.193.99
> 140.90.226.201
> 140.90.226.202
> 140.90.226.203
> 140.90.226.204
> 140.90.37.12
> 140.90.37.13
> 140.90.37.15
> 140.90.37.16
> 140.90.37.40
> 144.92.130.88
> 144.92.131.244
> 150.9.117.128
> 192.12.209.57
> 192.58.3.194
> 192.58.3.195
> 192.58.3.196
> 192.58.3.197
> 193.61.196.74
> 198.181.231.53
> 208.64.117.128
>
>
> Justin Cooke wrote:
> > Chi,
> >
> > The reboot doesn't seem to have helped. Is there anything else that may
> > be causing these issues? Network related after I performed the restart
> > of LDM? Steve has a few possibilities:
> >
> > /It seems to be network related at your end, but strange that it
> > occurred at the time when you retsrtaed the LDM- unless there was some
> > sort of firewall or packet filter that occurred when the LDM's
> > re-connected. /
> >
> > Justin
> >
> > Steve Chiswell wrote:
> >> Justin,
> >>
> >> I haven't seen any improvement from ncepldm to the top level relays
> >> daffy.unidata.ucar.edu (Unidata), idd.aos.wisc.edu (U. WIsconsin),
> >> flood.atmos.uiuc.edu (U. Illinois) or atm.cise-nsf.gov (NSF, DC).
> >>
> >> It seems to be network related at your end, but strange that it occurred
> >> at the time when you retsrtaed the LDM- unless there was some sort of
> >> firewall or packet filter that occurred when the LDM's re-connected.
> >>
> >> Thanks for your time in looking at this,
> >>
> >> Steve
> >>
> >>
> >> On Fri, 2007-06-15 at 15:31 -0400, Justin Cooke wrote:
> >>
> >>> Steve and Doug,
> >>>
> >>> I just got a call from Chi at the WOC, he rebooted LDM1 after noticing
> >>> an unusual load on the machine. LDM is again running on that box and it
> >>> remains primary, can you check to see how the latencies are now?
> >>>
> >>> Thanks,
> >>>
> >>> Justin
> >>>
> >>> Doug Schuster wrote:
> >>>
> >>>> Justin,
> >>>>
> >>>> 28,079 products are missing from the 12z cycle.  You'll be getting the
> >>>> automated email shortly.
> >>>>
> >>>> -Doug
> >>>>
> >>>> On Jun 15, 2007, at 12:48 PM, Justin Cooke wrote:
> >>>>
> >>>>
> >>>>> Steve,
> >>>>>
> >>>>> I've turned off the feed to LDM2.
> >>>>>
> >>>>> There is no other load on the ldm1 system except for LDM.
> >>>>>
> >>>>> Doug, are you missing many of the TIGGE params for 12Z?
> >>>>>
> >>>>> Justin
> >>>>>
> >>>>> Steve Chiswell wrote:
> >>>>>
> >>>>>> Justin,
> >>>>>>
> >>>>>> That didn't change the behavior. Still seeing latency.
> >>>>>> perhaps turning off the other feed. Is there any load
> >>>>>> other than LDM on the system?
> >>>>>>
> >>>>>> Steve
> >>>>>>
> >>>>>>
> >>>>>> On Fri, 2007-06-15 at 12:56 -0400, Justin Cooke wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Steve,
> >>>>>>>
> >>>>>>> I've recreated the queue, let me know if you are still seeing issues.
> >>>>>>>
> >>>>>>> If so I'll turn off the feed to ldm2 to see if that corrects things.
> >>>>>>>
> >>>>>>> Justin
> >>>>>>>
> >>>>>>> Steve Chiswell wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>> Justin,
> >>>>>>>>
> >>>>>>>> I don't know if they saw a disk space problem with
> >>>>>>>> log files not being rotated, but it might just be
> >>>>>>>> best today to build a new queue:
> >>>>>>>>
> >>>>>>>> ldmadmin stop
> >>>>>>>> ldmadmin delqueue
> >>>>>>>> ldmadmin mkqueue
> >>>>>>>> ldmadmin start
> >>>>>>>>
> >>>>>>>> That will mean some queued data would be lost, but if users aren't
> >>>>>>>> getting it
> >>>>>>>> anyway, then its best to ensure that the queue isn't corrupt for the
> >>>>>>>> weekend.
> >>>>>>>>
> >>>>>>>> Happy Friday....
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Steve
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, 2007-06-15 at 12:13 -0400, Justin Cooke wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Steve,
> >>>>>>>>>
> >>>>>>>>> Our logs on the primary ldm system "ldm1" had not rotated for
> >>>>>>>>> nearly a week. I sent email to the WOC support and this was the
> >>>>>>>>> response:
> >>>>>>>>>
> >>>>>>>>> Looks like the seed file was missing after we brought the system
> >>>>>>>>> backup
> >>>>>>>>> from the last outage.  should be good now.
> >>>>>>>>>
> >>>>>>>>> Justin Cooke wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> WOC,
> >>>>>>>>>>
> >>>>>>>>>> I noticed that our logs for LDM have not been rotated on machine
> >>>>>>>>>> ldm1
> >>>>>>>>>> since 06/05/2007. We have a cron entry that runs "ldmadmin
> >>>>>>>>>> newlog" at
> >>>>>>>>>> 00Z every day.
> >>>>>>>>>>
> >>>>>>>>>> I attempted to run the command by hand and got the following back:
> >>>>>>>>>>
> >>>>>>>>>> ldm@ldm1:~$ bin/ldmadmin newlog
> >>>>>>>>>> hupsyslog: couldn't open /var/run/syslogd.pid
> >>>>>>>>>>
> >>>>>>>>>> I checked but /var/run/syslogd.pid is not there but it is on ldm2.
> >>>>>>>>>>
> >>>>>>>>>> Could there be a problem with syslogd on ldm1?
> >>>>>>>>>>
> >>>>>>>>>> Justin
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>> Also around that time I turned on our backup feed to the ldm2
> >>>>>>>>> system which had been off since that system had issues a few
> >>>>>>>>> weeks ago (we were asked by WOC to turn it back on). I have sent
> >>>>>>>>> email to their support group asking if both ldm1 and ldm2 are
> >>>>>>>>> responding to the ncepldm.woc.noaa.gov address or if something
> >>>>>>>>> else is going on.
> >>>>>>>>>
> >>>>>>>>> Justin
> >>>>>>>>>
> >>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Justin,
> >>>>>>>>>>
> >>>>>>>>>> Yesterday just after 18Z, the data flow from ncepldm.woc.noaa.gov
> >>>>>>>>>> to top level sites at NSF and Unidata both began showing high
> >>>>>>>>>> latency:
> >>>>>>>>>>
> >>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
> >>>>>>>>>> +atm.cise-nsf.gov
> >>>>>>>>>>  and
> >>>>>>>>>>
> >>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
> >>>>>>>>>> +daffy.unidata.ucar.edu
> >>>>>>>>>>
> >>>>>>>>>> Data volume out has dropped as a result:
> >>>>>>>>>>
> >>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT
> >>>>>>>>>> +atm.cise-nsf.gov
> >>>>>>>>>>
> >>>>>>>>>> Since the behavior is similar at both sites at separate
> >>>>>>>>>> locations, the
> >>>>>>>>>> problem would appear to be near your end. Since that coincides
> >>>>>>>>>> with your
> >>>>>>>>>> restart of the LDM, could you fill me in on the issues you were
> >>>>>>>>>> experiencing?
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>>
> >>>>>>>>>> Steve Chiswell
> >>>>>>>>>> Unidata User Support
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, 2007-06-15 at 11:38 -0400, Justin Cooke wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Doug,
> >>>>>>>>>>>
> >>>>>>>>>>> I had to restart our LDM yesterday right before the 18Z cycle,
> >>>>>>>>>>> we had an issue with out logging but none of the configuration
> >>>>>>>>>>> files changed. Could one of your feeds have lost the connection
> >>>>>>>>>>> to our LDM during that restart?
> >>>>>>>>>>>
> >>>>>>>>>>> Justin
> >>>>>>>>>>>
> >>>>>>>>>>> Douglas Schuster wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> Yes,  we've received partial cycles.  More than half of the
> >>>>>>>>>>>> expected fields have been missing
> >>>>>>>>>>>> in each cycle from June 14 18Z, to June 15, 06Z.  The number
> >>>>>>>>>>>> of missing fields varies between
> >>>>>>>>>>>> each cycle.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Doug
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Jun 15, 2007, at 9:11 AM, Justin Cooke wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Doug,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Have you received any GEFS data from us today? Or is it just
> >>>>>>>>>>>>> certain fields you are missing?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
>
>
> --
> Chi Y. Kang
> Contractor
> Principal Engineer
> Phone: 301-713-3333 x201
> Cell: 240-338-1059
>