[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: issues with LDM



Justin, is all the data feeds getting into LDM2 and LDM1 ? Also, is
there a way we can test the connection to the LDM system via over I1
rather then I2?


Douglas Schuster wrote:
> Hi Justin,
>
> The latencies are still looking bad using ldm2. This has lead to the
> continuation of large numbers
> of missing fields (identical on both receiving machines, NCAR and
> Unidata) from all model cycles over the weekend,
> continuing this morning.
>
> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd.unidata.ucar.edu+LOG
>
>
> Doug
>
> On Jun 18, 2007, at 6:32 AM, Justin Cooke wrote:
>
>> You're right Chi, I misread the graph, we are on ldm2 and have been
>> since Saturday. My apologies.
>>
>> Justin
>>
>> Chi Y Kang wrote:
>>> Justin Cooke wrote:
>>>> Chi,
>>>>
>>>> It looks like you switched us back to ldm1 on Saturday but
>>>> according to the graphs from Steve they experienced the same delays.
>>>
>>> Running on ldm2 right now. It looks like the send-Q on our end seems
>>> to be okay. All of these connections are going via I2, let me see
>>> what rate limit there is set on I2 connection coming out of the campus.
>>>
>>>
>>>>
>>>> Justin
>>>>
>>>> Steve Chiswell wrote:
>>>>> Chi & Justin,
>>>>>
>>>>> The latency of data today has been high like yesterday, even with
>>>>> the switch of
>>>>> ldm2. The throughput looks restricted either by a router or
>>>>> firewall/packet
>>>>> shaping, but was wondering if coincident with Justin's restart was
>>>>> that the
>>>>> connections had to be re-established, so changes took effect at
>>>>> that time.
>>>>>
>>>>> Thanks for all your efforts,
>>>>>
>>>>> Steve Chiswell
>>>>> Unidata User Support
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 15 Jun 2007, Chi Y Kang wrote:
>>>>>
>>>>>
>>>>>> Wait a minute here,
>>>>>>
>>>>>> 128.117.140.208 isn't in the mix. The other hosts are.
>>>>>>
>>>>>> I updated the LDM access list. Should we just have some class C
>>>>>> ranges
>>>>>> to have access rather then ip at a time?
>>>>>>
>>>>>> Also, i noticed that the send-Q are pretty normal on ldm2 server
>>>>>> right
>>>>>> now but was pretty high on ldm1. might be just an issue with the ACL
>>>>>> list.
>>>>>>
>>>>>>
>>>>>> 128.117.12.2
>>>>>> 128.117.12.3
>>>>>> 128.117.130.220
>>>>>> 128.117.140.208
>>>>>> 128.117.140.220
>>>>>> 128.117.149.220
>>>>>> 128.117.156.220
>>>>>> 128.174.80.16
>>>>>> 128.174.80.47
>>>>>> 140.90.193.19
>>>>>> 140.90.193.227
>>>>>> 140.90.193.228
>>>>>> 140.90.193.99
>>>>>> 140.90.226.201
>>>>>> 140.90.226.202
>>>>>> 140.90.226.203
>>>>>> 140.90.226.204
>>>>>> 140.90.37.12
>>>>>> 140.90.37.13
>>>>>> 140.90.37.15
>>>>>> 140.90.37.16
>>>>>> 140.90.37.40
>>>>>> 144.92.130.88
>>>>>> 144.92.131.244
>>>>>> 150.9.117.128
>>>>>> 192.12.209.57
>>>>>> 192.58.3.194
>>>>>> 192.58.3.195
>>>>>> 192.58.3.196
>>>>>> 192.58.3.197
>>>>>> 193.61.196.74
>>>>>> 198.181.231.53
>>>>>> 208.64.117.128
>>>>>>
>>>>>>
>>>>>> Justin Cooke wrote:
>>>>>>
>>>>>>> Chi,
>>>>>>>
>>>>>>> The reboot doesn't seem to have helped. Is there anything else
>>>>>>> that may
>>>>>>> be causing these issues? Network related after I performed the
>>>>>>> restart
>>>>>>> of LDM? Steve has a few possibilities:
>>>>>>>
>>>>>>> /It seems to be network related at your end, but strange that it
>>>>>>> occurred at the time when you retsrtaed the LDM- unless there
>>>>>>> was some
>>>>>>> sort of firewall or packet filter that occurred when the LDM's
>>>>>>> re-connected. /
>>>>>>>
>>>>>>> Justin
>>>>>>>
>>>>>>> Steve Chiswell wrote:
>>>>>>>
>>>>>>>> Justin,
>>>>>>>>
>>>>>>>> I haven't seen any improvement from ncepldm to the top level
>>>>>>>> relays
>>>>>>>> daffy.unidata.ucar.edu (Unidata), idd.aos.wisc.edu (U. WIsconsin),
>>>>>>>> flood.atmos.uiuc.edu (U. Illinois) or atm.cise-nsf.gov (NSF, DC).
>>>>>>>>
>>>>>>>> It seems to be network related at your end, but strange that it
>>>>>>>> occurred
>>>>>>>> at the time when you retsrtaed the LDM- unless there was some
>>>>>>>> sort of
>>>>>>>> firewall or packet filter that occurred when the LDM's
>>>>>>>> re-connected.
>>>>>>>>
>>>>>>>> Thanks for your time in looking at this,
>>>>>>>>
>>>>>>>> Steve
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, 2007-06-15 at 15:31 -0400, Justin Cooke wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Steve and Doug,
>>>>>>>>>
>>>>>>>>> I just got a call from Chi at the WOC, he rebooted LDM1 after
>>>>>>>>> noticing
>>>>>>>>> an unusual load on the machine. LDM is again running on that
>>>>>>>>> box and it
>>>>>>>>> remains primary, can you check to see how the latencies are now?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Justin
>>>>>>>>>
>>>>>>>>> Doug Schuster wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Justin,
>>>>>>>>>>
>>>>>>>>>> 28,079 products are missing from the 12z cycle. You'll be
>>>>>>>>>> getting the
>>>>>>>>>> automated email shortly.
>>>>>>>>>>
>>>>>>>>>> -Doug
>>>>>>>>>>
>>>>>>>>>> On Jun 15, 2007, at 12:48 PM, Justin Cooke wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Steve,
>>>>>>>>>>>
>>>>>>>>>>> I've turned off the feed to LDM2.
>>>>>>>>>>>
>>>>>>>>>>> There is no other load on the ldm1 system except for LDM.
>>>>>>>>>>>
>>>>>>>>>>> Doug, are you missing many of the TIGGE params for 12Z?
>>>>>>>>>>>
>>>>>>>>>>> Justin
>>>>>>>>>>>
>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Justin,
>>>>>>>>>>>>
>>>>>>>>>>>> That didn't change the behavior. Still seeing latency.
>>>>>>>>>>>> perhaps turning off the other feed. Is there any load
>>>>>>>>>>>> other than LDM on the system?
>>>>>>>>>>>>
>>>>>>>>>>>> Steve
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 2007-06-15 at 12:56 -0400, Justin Cooke wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> Steve,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've recreated the queue, let me know if you are still
>>>>>>>>>>>>> seeing issues.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If so I'll turn off the feed to ldm2 to see if that
>>>>>>>>>>>>> corrects things.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>
>>>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Justin,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't know if they saw a disk space problem with
>>>>>>>>>>>>>> log files not being rotated, but it might just be
>>>>>>>>>>>>>> best today to build a new queue:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ldmadmin stop
>>>>>>>>>>>>>> ldmadmin delqueue
>>>>>>>>>>>>>> ldmadmin mkqueue
>>>>>>>>>>>>>> ldmadmin start
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That will mean some queued data would be lost, but if
>>>>>>>>>>>>>> users aren't
>>>>>>>>>>>>>> getting it
>>>>>>>>>>>>>> anyway, then its best to ensure that the queue isn't
>>>>>>>>>>>>>> corrupt for the
>>>>>>>>>>>>>> weekend.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Happy Friday....
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, 2007-06-15 at 12:13 -0400, Justin Cooke wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Steve,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Our logs on the primary ldm system "ldm1" had not
>>>>>>>>>>>>>>> rotated for
>>>>>>>>>>>>>>> nearly a week. I sent email to the WOC support and this
>>>>>>>>>>>>>>> was the
>>>>>>>>>>>>>>> response:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looks like the seed file was missing after we brought
>>>>>>>>>>>>>>> the system
>>>>>>>>>>>>>>> backup
>>>>>>>>>>>>>>> from the last outage. should be good now.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Justin Cooke wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> WOC,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I noticed that our logs for LDM have not been rotated
>>>>>>>>>>>>>>>> on machine
>>>>>>>>>>>>>>>> ldm1
>>>>>>>>>>>>>>>> since 06/05/2007. We have a cron entry that runs "ldmadmin
>>>>>>>>>>>>>>>> newlog" at
>>>>>>>>>>>>>>>> 00Z every day.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I attempted to run the command by hand and got the
>>>>>>>>>>>>>>>> following back:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ldm@ldm1:~$ bin/ldmadmin newlog
>>>>>>>>>>>>>>>> hupsyslog: couldn't open /var/run/syslogd.pid
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I checked but /var/run/syslogd.pid is not there but it
>>>>>>>>>>>>>>>> is on ldm2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Could there be a problem with syslogd on ldm1?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also around that time I turned on our backup feed to the
>>>>>>>>>>>>>>> ldm2
>>>>>>>>>>>>>>> system which had been off since that system had issues a
>>>>>>>>>>>>>>> few
>>>>>>>>>>>>>>> weeks ago (we were asked by WOC to turn it back on). I
>>>>>>>>>>>>>>> have sent
>>>>>>>>>>>>>>> email to their support group asking if both ldm1 and
>>>>>>>>>>>>>>> ldm2 are
>>>>>>>>>>>>>>> responding to the ncepldm.woc.noaa.gov address or if
>>>>>>>>>>>>>>> something
>>>>>>>>>>>>>>> else is going on.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Justin,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yesterday just after 18Z, the data flow from
>>>>>>>>>>>>>>>> ncepldm.woc.noaa.gov
>>>>>>>>>>>>>>>> to top level sites at NSF and Unidata both began
>>>>>>>>>>>>>>>> showing high
>>>>>>>>>>>>>>>> latency:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +atm.cise-nsf.gov
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +daffy.unidata.ucar.edu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Data volume out has dropped as a result:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_vol_nc?CONDUIT
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +atm.cise-nsf.gov
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since the behavior is similar at both sites at separate
>>>>>>>>>>>>>>>> locations, the
>>>>>>>>>>>>>>>> problem would appear to be near your end. Since that
>>>>>>>>>>>>>>>> coincides
>>>>>>>>>>>>>>>> with your
>>>>>>>>>>>>>>>> restart of the LDM, could you fill me in on the issues
>>>>>>>>>>>>>>>> you were
>>>>>>>>>>>>>>>> experiencing?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Steve Chiswell
>>>>>>>>>>>>>>>> Unidata User Support
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, 2007-06-15 at 11:38 -0400, Justin Cooke wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Doug,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I had to restart our LDM yesterday right before the
>>>>>>>>>>>>>>>>> 18Z cycle,
>>>>>>>>>>>>>>>>> we had an issue with out logging but none of the
>>>>>>>>>>>>>>>>> configuration
>>>>>>>>>>>>>>>>> files changed. Could one of your feeds have lost the
>>>>>>>>>>>>>>>>> connection
>>>>>>>>>>>>>>>>> to our LDM during that restart?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Douglas Schuster wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Yes, we've received partial cycles. More than half of
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> expected fields have been missing
>>>>>>>>>>>>>>>>>> in each cycle from June 14 18Z, to June 15, 06Z. The
>>>>>>>>>>>>>>>>>> number
>>>>>>>>>>>>>>>>>> of missing fields varies between
>>>>>>>>>>>>>>>>>> each cycle.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Doug
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Jun 15, 2007, at 9:11 AM, Justin Cooke wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Doug,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Have you received any GEFS data from us today? Or is
>>>>>>>>>>>>>>>>>>> it just
>>>>>>>>>>>>>>>>>>> certain fields you are missing?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Justin
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>> --Chi Y. Kang
>>>>>> Contractor
>>>>>> Principal Engineer
>>>>>> Phone: 301-713-3333 x201
>>>>>> Cell: 240-338-1059
>>>>>>
>>>>>>
>>>
>>>
>


-- 
Chi Y. Kang
Contractor
Principal Engineer
Phone: 301-713-3333 x201
Cell: 240-338-1059