[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: issues with LDM



Chi and Justin,

You mentioned a rate limit for your I2 connection.
The throughput before Thursday had been up to 4GB per hour (with a few
periods higher). As of Thursday we are seeing about 500MB per hour,
which more closely
approaches a T1 capacity:
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc1?CONDUIT+atm.cise-nsf.gov
Could a default route or gateway have changed?

Steve Chiswell
Unidata User Support



On Mon, 2007-06-18 at 08:03 -0400, Chi Y Kang wrote:
> Justin Cooke wrote:
> > Chi,
> > 
> > It looks like you switched us back to ldm1 on Saturday but according to 
> > the graphs from Steve they experienced the same delays.
> 
> Running on ldm2 right now.  It looks like the send-Q on our end seems to 
> be okay.  All of these connections are going via I2, let me see what 
> rate limit there is set on I2 connection coming out of the campus.
> 
> 
> > 
> > Justin
> > 
> > Steve Chiswell wrote:
> >> Chi & Justin,
> >>
> >> The latency of data today has been high like yesterday, even with the 
> >> switch of
> >> ldm2. The throughput looks restricted either by a router or 
> >> firewall/packet
> >> shaping, but was wondering if coincident with Justin's restart was 
> >> that the
> >> connections had to be re-established, so changes took effect at that 
> >> time.
> >>
> >> Thanks for all your efforts,
> >>
> >> Steve Chiswell
> >> Unidata User Support
> >>
> >>
> >>
> >> On Fri, 15 Jun 2007, Chi Y Kang wrote:
> >>
> >>  
> >>> Wait a minute here,
> >>>
> >>> 128.117.140.208 isn't in the mix.  The other hosts are.
> >>>
> >>> I updated the LDM access list.  Should we just have some class C ranges
> >>> to have access rather then ip at a time?
> >>>
> >>> Also, i noticed that the send-Q are pretty normal on ldm2 server right
> >>> now  but was pretty high on ldm1.  might be just an issue with the ACL
> >>> list.
> >>>
> >>>
> >>> 128.117.12.2
> >>> 128.117.12.3
> >>> 128.117.130.220
> >>> 128.117.140.208
> >>> 128.117.140.220
> >>> 128.117.149.220
> >>> 128.117.156.220
> >>> 128.174.80.16
> >>> 128.174.80.47
> >>> 140.90.193.19
> >>> 140.90.193.227
> >>> 140.90.193.228
> >>> 140.90.193.99
> >>> 140.90.226.201
> >>> 140.90.226.202
> >>> 140.90.226.203
> >>> 140.90.226.204
> >>> 140.90.37.12
> >>> 140.90.37.13
> >>> 140.90.37.15
> >>> 140.90.37.16
> >>> 140.90.37.40
> >>> 144.92.130.88
> >>> 144.92.131.244
> >>> 150.9.117.128
> >>> 192.12.209.57
> >>> 192.58.3.194
> >>> 192.58.3.195
> >>> 192.58.3.196
> >>> 192.58.3.197
> >>> 193.61.196.74
> >>> 198.181.231.53
> >>> 208.64.117.128
> >>>
> >>>
> >>> Justin Cooke wrote:
> >>>    
> >>>> Chi,
> >>>>
> >>>> The reboot doesn't seem to have helped. Is there anything else that may
> >>>> be causing these issues? Network related after I performed the restart
> >>>> of LDM? Steve has a few possibilities:
> >>>>
> >>>> /It seems to be network related at your end, but strange that it
> >>>> occurred at the time when you retsrtaed the LDM- unless there was some
> >>>> sort of firewall or packet filter that occurred when the LDM's
> >>>> re-connected. /
> >>>>
> >>>> Justin
> >>>>
> >>>> Steve Chiswell wrote:
> >>>>      
> >>>>> Justin,
> >>>>>
> >>>>> I haven't seen any improvement from ncepldm to the top level relays
> >>>>> daffy.unidata.ucar.edu (Unidata), idd.aos.wisc.edu (U. WIsconsin),
> >>>>> flood.atmos.uiuc.edu (U. Illinois) or atm.cise-nsf.gov (NSF, DC).
> >>>>>
> >>>>> It seems to be network related at your end, but strange that it 
> >>>>> occurred
> >>>>> at the time when you retsrtaed the LDM- unless there was some sort of
> >>>>> firewall or packet filter that occurred when the LDM's re-connected.
> >>>>>
> >>>>> Thanks for your time in looking at this,
> >>>>>
> >>>>> Steve
> >>>>>
> >>>>>
> >>>>> On Fri, 2007-06-15 at 15:31 -0400, Justin Cooke wrote:
> >>>>>
> >>>>>        
> >>>>>> Steve and Doug,
> >>>>>>
> >>>>>> I just got a call from Chi at the WOC, he rebooted LDM1 after 
> >>>>>> noticing
> >>>>>> an unusual load on the machine. LDM is again running on that box 
> >>>>>> and it
> >>>>>> remains primary, can you check to see how the latencies are now?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Justin
> >>>>>>
> >>>>>> Doug Schuster wrote:
> >>>>>>
> >>>>>>          
> >>>>>>> Justin,
> >>>>>>>
> >>>>>>> 28,079 products are missing from the 12z cycle.  You'll be 
> >>>>>>> getting the
> >>>>>>> automated email shortly.
> >>>>>>>
> >>>>>>> -Doug
> >>>>>>>
> >>>>>>> On Jun 15, 2007, at 12:48 PM, Justin Cooke wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>            
> >>>>>>>> Steve,
> >>>>>>>>
> >>>>>>>> I've turned off the feed to LDM2.
> >>>>>>>>
> >>>>>>>> There is no other load on the ldm1 system except for LDM.
> >>>>>>>>
> >>>>>>>> Doug, are you missing many of the TIGGE params for 12Z?
> >>>>>>>>
> >>>>>>>> Justin
> >>>>>>>>
> >>>>>>>> Steve Chiswell wrote:
> >>>>>>>>
> >>>>>>>>              
> >>>>>>>>> Justin,
> >>>>>>>>>
> >>>>>>>>> That didn't change the behavior. Still seeing latency.
> >>>>>>>>> perhaps turning off the other feed. Is there any load
> >>>>>>>>> other than LDM on the system?
> >>>>>>>>>
> >>>>>>>>> Steve
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, 2007-06-15 at 12:56 -0400, Justin Cooke wrote:
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>                
> >>>>>>>>>> Steve,
> >>>>>>>>>>
> >>>>>>>>>> I've recreated the queue, let me know if you are still seeing 
> >>>>>>>>>> issues.
> >>>>>>>>>>
> >>>>>>>>>> If so I'll turn off the feed to ldm2 to see if that corrects 
> >>>>>>>>>> things.
> >>>>>>>>>>
> >>>>>>>>>> Justin
> >>>>>>>>>>
> >>>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>                  
> >>>>>>>>>>> Justin,
> >>>>>>>>>>>
> >>>>>>>>>>> I don't know if they saw a disk space problem with
> >>>>>>>>>>> log files not being rotated, but it might just be
> >>>>>>>>>>> best today to build a new queue:
> >>>>>>>>>>>
> >>>>>>>>>>> ldmadmin stop
> >>>>>>>>>>> ldmadmin delqueue
> >>>>>>>>>>> ldmadmin mkqueue
> >>>>>>>>>>> ldmadmin start
> >>>>>>>>>>>
> >>>>>>>>>>> That will mean some queued data would be lost, but if users 
> >>>>>>>>>>> aren't
> >>>>>>>>>>> getting it
> >>>>>>>>>>> anyway, then its best to ensure that the queue isn't corrupt 
> >>>>>>>>>>> for the
> >>>>>>>>>>> weekend.
> >>>>>>>>>>>
> >>>>>>>>>>> Happy Friday....
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Steve
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, 2007-06-15 at 12:13 -0400, Justin Cooke wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>                    
> >>>>>>>>>>>> Steve,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Our logs on the primary ldm system "ldm1" had not rotated for
> >>>>>>>>>>>> nearly a week. I sent email to the WOC support and this was the
> >>>>>>>>>>>> response:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Looks like the seed file was missing after we brought the 
> >>>>>>>>>>>> system
> >>>>>>>>>>>> backup
> >>>>>>>>>>>> from the last outage.  should be good now.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Justin Cooke wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>                      
> >>>>>>>>>>>>> WOC,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I noticed that our logs for LDM have not been rotated on 
> >>>>>>>>>>>>> machine
> >>>>>>>>>>>>> ldm1
> >>>>>>>>>>>>> since 06/05/2007. We have a cron entry that runs "ldmadmin
> >>>>>>>>>>>>> newlog" at
> >>>>>>>>>>>>> 00Z every day.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I attempted to run the command by hand and got the 
> >>>>>>>>>>>>> following back:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ldm@ldm1:~$ bin/ldmadmin newlog
> >>>>>>>>>>>>> hupsyslog: couldn't open /var/run/syslogd.pid
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I checked but /var/run/syslogd.pid is not there but it is 
> >>>>>>>>>>>>> on ldm2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Could there be a problem with syslogd on ldm1?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>                         
> >>>>>>>>>>>> Also around that time I turned on our backup feed to the ldm2
> >>>>>>>>>>>> system which had been off since that system had issues a few
> >>>>>>>>>>>> weeks ago (we were asked by WOC to turn it back on). I have 
> >>>>>>>>>>>> sent
> >>>>>>>>>>>> email to their support group asking if both ldm1 and ldm2 are
> >>>>>>>>>>>> responding to the ncepldm.woc.noaa.gov address or if something
> >>>>>>>>>>>> else is going on.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Justin
> >>>>>>>>>>>>
> >>>>>>>>>>>> Steve Chiswell wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>                      
> >>>>>>>>>>>>> Justin,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Yesterday just after 18Z, the data flow from 
> >>>>>>>>>>>>> ncepldm.woc.noaa.gov
> >>>>>>>>>>>>> to top level sites at NSF and Unidata both began showing high
> >>>>>>>>>>>>> latency:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT 
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +atm.cise-nsf.gov
> >>>>>>>>>>>>>  and
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT 
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +daffy.unidata.ucar.edu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Data volume out has dropped as a result:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT
> >>>>>>>>>>>>>  
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> +atm.cise-nsf.gov
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Since the behavior is similar at both sites at separate
> >>>>>>>>>>>>> locations, the
> >>>>>>>>>>>>> problem would appear to be near your end. Since that coincides
> >>>>>>>>>>>>> with your
> >>>>>>>>>>>>> restart of the LDM, could you fill me in on the issues you 
> >>>>>>>>>>>>> were
> >>>>>>>>>>>>> experiencing?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Steve Chiswell
> >>>>>>>>>>>>> Unidata User Support
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, 2007-06-15 at 11:38 -0400, Justin Cooke wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>                        
> >>>>>>>>>>>>>> Doug,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I had to restart our LDM yesterday right before the 18Z 
> >>>>>>>>>>>>>> cycle,
> >>>>>>>>>>>>>> we had an issue with out logging but none of the 
> >>>>>>>>>>>>>> configuration
> >>>>>>>>>>>>>> files changed. Could one of your feeds have lost the 
> >>>>>>>>>>>>>> connection
> >>>>>>>>>>>>>> to our LDM during that restart?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Douglas Schuster wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>                          
> >>>>>>>>>>>>>>> Yes,  we've received partial cycles.  More than half of the
> >>>>>>>>>>>>>>> expected fields have been missing
> >>>>>>>>>>>>>>> in each cycle from June 14 18Z, to June 15, 06Z.  The number
> >>>>>>>>>>>>>>> of missing fields varies between
> >>>>>>>>>>>>>>> each cycle.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Doug
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Jun 15, 2007, at 9:11 AM, Justin Cooke wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>                            
> >>>>>>>>>>>>>>>> Doug,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Have you received any GEFS data from us today? Or is it 
> >>>>>>>>>>>>>>>> just
> >>>>>>>>>>>>>>>> certain fields you are missing?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Justin
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>                               
> >>> -- 
> >>> Chi Y. Kang
> >>> Contractor
> >>> Principal Engineer
> >>> Phone: 301-713-3333 x201
> >>> Cell: 240-338-1059
> >>>
> >>>     
> 
> 
-- 
Steve Chiswell <address@hidden>
Unidata