[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: issues with LDM



Justin,

I haven't seen any improvement from ncepldm to the top level relays
daffy.unidata.ucar.edu (Unidata), idd.aos.wisc.edu (U. WIsconsin),
flood.atmos.uiuc.edu (U. Illinois) or atm.cise-nsf.gov (NSF, DC).

It seems to be network related at your end, but strange that it occurred
at the time when you retsrtaed the LDM- unless there was some sort of
firewall or packet filter that occurred when the LDM's re-connected.

Thanks for your time in looking at this,

Steve


On Fri, 2007-06-15 at 15:31 -0400, Justin Cooke wrote:
> Steve and Doug,
> 
> I just got a call from Chi at the WOC, he rebooted LDM1 after noticing 
> an unusual load on the machine. LDM is again running on that box and it 
> remains primary, can you check to see how the latencies are now?
> 
> Thanks,
> 
> Justin
> 
> Doug Schuster wrote:
> > Justin,
> >
> > 28,079 products are missing from the 12z cycle.  You'll be getting the 
> > automated email shortly.
> >
> > -Doug
> >
> > On Jun 15, 2007, at 12:48 PM, Justin Cooke wrote:
> >
> >> Steve,
> >>
> >> I've turned off the feed to LDM2.
> >>
> >> There is no other load on the ldm1 system except for LDM.
> >>
> >> Doug, are you missing many of the TIGGE params for 12Z?
> >>
> >> Justin
> >>
> >> Steve Chiswell wrote:
> >>> Justin,
> >>>
> >>> That didn't change the behavior. Still seeing latency.
> >>> perhaps turning off the other feed. Is there any load
> >>> other than LDM on the system?
> >>>
> >>> Steve
> >>>
> >>>
> >>> On Fri, 2007-06-15 at 12:56 -0400, Justin Cooke wrote:
> >>>
> >>>> Steve,
> >>>>
> >>>> I've recreated the queue, let me know if you are still seeing issues.
> >>>>
> >>>> If so I'll turn off the feed to ldm2 to see if that corrects things.
> >>>>
> >>>> Justin
> >>>>
> >>>> Steve Chiswell wrote:
> >>>>
> >>>>> Justin,
> >>>>>
> >>>>> I don't know if they saw a disk space problem with
> >>>>> log files not being rotated, but it might just be
> >>>>> best today to build a new queue:
> >>>>>
> >>>>> ldmadmin stop
> >>>>> ldmadmin delqueue
> >>>>> ldmadmin mkqueue
> >>>>> ldmadmin start
> >>>>>
> >>>>> That will mean some queued data would be lost, but if users aren't
> >>>>> getting it
> >>>>> anyway, then its best to ensure that the queue isn't corrupt for the
> >>>>> weekend.
> >>>>>
> >>>>> Happy Friday....
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Steve
> >>>>>
> >>>>>
> >>>>> On Fri, 2007-06-15 at 12:13 -0400, Justin Cooke wrote:
> >>>>>
> >>>>>> Steve,
> >>>>>>
> >>>>>> Our logs on the primary ldm system "ldm1" had not rotated for 
> >>>>>> nearly a week. I sent email to the WOC support and this was the 
> >>>>>> response:
> >>>>>>
> >>>>>> Looks like the seed file was missing after we brought the system 
> >>>>>> backup
> >>>>>> from the last outage.  should be good now.
> >>>>>>
> >>>>>> Justin Cooke wrote:
> >>>>>>
> >>>>>>
> >>>>>>> WOC,
> >>>>>>>
> >>>>>>> I noticed that our logs for LDM have not been rotated on machine 
> >>>>>>> ldm1
> >>>>>>> since 06/05/2007. We have a cron entry that runs "ldmadmin 
> >>>>>>> newlog" at
> >>>>>>> 00Z every day.
> >>>>>>>
> >>>>>>> I attempted to run the command by hand and got the following back:
> >>>>>>>
> >>>>>>> ldm@ldm1:~$ bin/ldmadmin newlog
> >>>>>>> hupsyslog: couldn't open /var/run/syslogd.pid
> >>>>>>>
> >>>>>>> I checked but /var/run/syslogd.pid is not there but it is on ldm2.
> >>>>>>>
> >>>>>>> Could there be a problem with syslogd on ldm1?
> >>>>>>>
> >>>>>>> Justin
> >>>>>>>
> >>>>>> Also around that time I turned on our backup feed to the ldm2 
> >>>>>> system which had been off since that system had issues a few 
> >>>>>> weeks ago (we were asked by WOC to turn it back on). I have sent 
> >>>>>> email to their support group asking if both ldm1 and ldm2 are 
> >>>>>> responding to the ncepldm.woc.noaa.gov address or if something 
> >>>>>> else is going on.
> >>>>>>
> >>>>>> Justin
> >>>>>>
> >>>>>> Steve Chiswell wrote:
> >>>>>>
> >>>>>>> Justin,
> >>>>>>>
> >>>>>>> Yesterday just after 18Z, the data flow from ncepldm.woc.noaa.gov
> >>>>>>> to top level sites at NSF and Unidata both began showing high 
> >>>>>>> latency:
> >>>>>>>
> >>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
> >>>>>>> +atm.cise-nsf.gov
> >>>>>>>  and
> >>>>>>>
> >>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
> >>>>>>> +daffy.unidata.ucar.edu
> >>>>>>>
> >>>>>>> Data volume out has dropped as a result:
> >>>>>>>
> >>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT
> >>>>>>> +atm.cise-nsf.gov
> >>>>>>>
> >>>>>>> Since the behavior is similar at both sites at separate 
> >>>>>>> locations, the
> >>>>>>> problem would appear to be near your end. Since that coincides 
> >>>>>>> with your
> >>>>>>> restart of the LDM, could you fill me in on the issues you were
> >>>>>>> experiencing?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>> Steve Chiswell
> >>>>>>> Unidata User Support
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, 2007-06-15 at 11:38 -0400, Justin Cooke wrote:
> >>>>>>>
> >>>>>>>> Doug,
> >>>>>>>>
> >>>>>>>> I had to restart our LDM yesterday right before the 18Z cycle, 
> >>>>>>>> we had an issue with out logging but none of the configuration 
> >>>>>>>> files changed. Could one of your feeds have lost the connection 
> >>>>>>>> to our LDM during that restart?
> >>>>>>>>
> >>>>>>>> Justin
> >>>>>>>>
> >>>>>>>> Douglas Schuster wrote:
> >>>>>>>>
> >>>>>>>>> Yes,  we've received partial cycles.  More than half of the 
> >>>>>>>>> expected fields have been missing
> >>>>>>>>> in each cycle from June 14 18Z, to June 15, 06Z.  The number 
> >>>>>>>>> of missing fields varies between
> >>>>>>>>> each cycle.
> >>>>>>>>>
> >>>>>>>>> Doug
> >>>>>>>>>
> >>>>>>>>> On Jun 15, 2007, at 9:11 AM, Justin Cooke wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Doug,
> >>>>>>>>>>
> >>>>>>>>>> Have you received any GEFS data from us today? Or is it just 
> >>>>>>>>>> certain fields you are missing?
> >>>>>>>>>>
> >>>>>>>>>> Justin
> >>>>>>>>>>
> >
-- 
Steve Chiswell <address@hidden>
Unidata