[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Top level CONDUIT relay



LDM2 is sitting on the same local network as node6.  I did want to ask
about another thing about this.  Is LDM coded to use shared memory like
oracle for example when accessing the ldm.pq file?  What method does it
use when accessing this with the ldm queues in memory?




ldm2:~$ ipcs -a

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch    
status     
0x00000000 0          root      600        3976       4         
dest        

------ Semaphore Arrays --------
key        semid      owner      perms      nsems    

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages   

Steve Chiswell wrote:
> On Wed, 2007-06-20 at 14:21 -0500, Pete Pokrandt wrote:
>   
>> Latencies from ldm2.woc to idd.aos.wisc.edu are rising again.
>>     
>
> Yes, and identically rising with NSF's connection. 
>
> The latency plot to node6 is staying high (aside from the clock
> apparently off
> about 80 seconds on node6) with a single request line.
>
> Justin, do you want to try the 5-way split of request lines on node6
> like Pete showed
> previously:
> REQUEST CONDUIT "[09]$" ldm1.woc.noaa.gov
> REQUEST CONDUIT "[18]$" ldm1.woc.noaa.gov
> REQUEST CONDUIT "[27]$" ldm1.woc.noaa.gov
> REQUEST CONDUIT "[36]$" ldm1.woc.noaa.gov
> REQUEST CONDUIT "[45]$" ldm1.woc.noaa.gov
>
> The only downside may be that if node6 is more succesful at getting
> data,
> we outsiders may see more trouble. 
>
> How "close" are node6 and ldm1/2? Are they on the same network
> internally
> or separated by a switch/router or something?
>
> Steve
>
>
>
>
>
>   
>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+idd.aos.wisc.edu
>>
>> Pete
>>
>> Steve Chiswell wrote:
>>     
>>> Justin,
>>>
>>> Since the change at 13Z by dropping daffy.unidata.ucar.edu out of the
>>> top level nodes the ldm2 feed to NSF is showing little/no latency at
>>> all. The ldm1 feed to NSF which is connected using the alternate LDM
>>> mode is only devivering the .status messages its creates as all the
>>> other products are duplicates of products already being received from
>>> LDM2 and that is showing high latency:
>>> http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?CONDUIT+atm.cise-nsf.gov
>>>
>>> This configuration is getting data out to the community at the moment.
>>> The downside here is that it puts a single point of failure at NSF in
>>> getting the data to Unidata, but
>>> I'll monitor that end.
>>>
>>> It seems that ldm1 is either slow, or it is showing network limitations
>>> (since
>>> flood.atmos.uiuc.edu is feeding from ncepldm which is apparently
>>> pointing to ldm1, there is load on ldm1 besides the NSF feed. LDM2 is
>>> feeding both NSF and idd.aos.wisc.edu (and Wisc looks good since 13Z as
>>> well) so it is able to
>>> handle the throughput to 2 downstreams, but adding daffy as the 3rd
>>> seems to 
>>> cross some point in volume of what can be sent out.
>>>
>>> Steve
>>>
>>> On Wed, 2007-06-20 at 09:45 -0400, Justin Cooke wrote:
>>>   
>>>       
>>>> Thanks Steve,
>>>>
>>>> Chi has set up a box on the lan for us to run LDM on, I am beginning to 
>>>> get things running on there.
>>>>
>>>> have you seen any improvement since dropping daffy?
>>>>
>>>> Justin
>>>>
>>>> On Jun 20, 2007, at 9:03 AM, Steve Chiswell wrote:
>>>>
>>>>     
>>>>         
>>>>> Justin,
>>>>>
>>>>> Yes, this does appear to be the case. I will drop daffy from feeding
>>>>> directly and instead move it to feed from NSF. That will remove one
>>>>> of the top level relays of data having to go out of NCEP and
>>>>> we can see if the other nodes show an improvement.
>>>>>
>>>>> Steve
>>>>>
>>>>> On Wed, 20 Jun 2007, Justin Cooke wrote:
>>>>>
>>>>>       
>>>>>           
>>>>>> Steve,
>>>>>>
>>>>>> Did you see a slowdown to ldm2 after Pete and the other sites began
>>>>>> making connections?
>>>>>>
>>>>>> Chi, considering steve saw a good connection to ldm1 before the other
>>>>>> sites connected doesn't that point toward a network issue?
>>>>>>
>>>>>> All of our queue processing on the diskserver has been running without
>>>>>> any problems so I don't believe anything on that system would 
>>>>>> impacting
>>>>>> ldm1/ldm2.
>>>>>>
>>>>>> Justin
>>>>>>
>>>>>> On Jun 20, 2007, at 12:04 AM, Chi Y Kang wrote:
>>>>>>
>>>>>>         
>>>>>>             
>>>>>>> I setup the test LDM server for the NCEP folks to test the local pull
>>>>>>> from the LDM servers.  That should give us some information / network
>>>>>>> or system related issue.  We'll handle that tomorrow.  I am a little
>>>>>>> bit concerned that the slow down all occurred at the some time as the
>>>>>>> ldm1 crash last week.
>>>>>>>
>>>>>>> Also, can NCEP also check if there are any bad dbnet queues on the
>>>>>>> backend servers?  Just to verify.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Steve Chiswell wrote:
>>>>>>>           
>>>>>>>               
>>>>>>>> Thanks Justin,
>>>>>>>> I also had a typo in my message:
>>>>>>>> ldm1 is running slower than ldm2
>>>>>>>> Now if the feed to ldm2 all of a sudden slows down if Pete and other
>>>>>>>> sites add a request to it, it would really signal some sort of total
>>>>>>>> bandwidth limitation
>>>>>>>> on the I2 connection. Seemed a little coincidental that we had a 
>>>>>>>> show
>>>>>>>> period
>>>>>>>> of good connectivity to ldm1 after which it slowed way down.
>>>>>>>> Steve
>>>>>>>> On Tue, 2007-06-19 at 17:01 -0400, Justin Cooke wrote:
>>>>>>>>             
>>>>>>>>                 
>>>>>>>>> I just realized the issue. When I disabled the "pqact" process on
>>>>>>>>> ldm2 earlier today it caused our monitor script (in cron, every 5
>>>>>>>>> min) to kill the LDM and restart it. I have removed the check for
>>>>>>>>> the pqact in that monitor...things should be a bit better now.
>>>>>>>>>
>>>>>>>>> Chi.Y.Kang wrote:
>>>>>>>>>               
>>>>>>>>>                   
>>>>>>>>>> Huh, i thought you guys were on the system.  let me take a look on
>>>>>>>>>> ldm2
>>>>>>>>>> and see what is going on.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Justin Cooke wrote:
>>>>>>>>>>
>>>>>>>>>>                 
>>>>>>>>>>                     
>>>>>>>>>>> Chi.Y.Kang wrote:
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>                       
>>>>>>>>>>>> Steve Chiswell wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>>>>                         
>>>>>>>>>>>>> Pete and David,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I changed the CONDUIT request lines at NSF and Unidata to
>>>>>>>>>>>>> request data
>>>>>>>>>>>>> from ldm1.woc.noaa.gov rather than ncepldm.woc.noaa.gov after
>>>>>>>>>>>>> seeing
>>>>>>>>>>>>> lots of
>>>>>>>>>>>>> disconnect/reconnects to the ncepldm virtual name.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The LDM appears to have caught up here as an interim solution.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Still don't know the cause of the problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>
>>>>>>>>>>>>>                       
>>>>>>>>>>>>>                           
>>>>>>>>>>>> I know the NCEP was stop and starting the LDM service on the 
>>>>>>>>>>>> ldm2
>>>>>>>>>>>> box
>>>>>>>>>>>> where the VIp address is pointed to at this time.  how is the
>>>>>>>>>>>> current
>>>>>>>>>>>> connection to LDM1?  is the speed of the conduit feed 
>>>>>>>>>>>> acceptable?
>>>>>>>>>>>>
>>>>>>>>>>>>                     
>>>>>>>>>>>>                         
>>>>>>>>>>> Chi, NCEP has not restarted the LDM on ldm2 at all today. But
>>>>>>>>>>> looking
>>>>>>>>>>> at the logs it appears to be dying and getting restarted by cron.
>>>>>>>>>>>
>>>>>>>>>>> I will watch and see if I see anything.
>>>>>>>>>>>
>>>>>>>>>>> Justin
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>                       
>>>>>>>>>>                 
>>>>>>>>>>                     
>>>>>>> --
>>>>>>> Chi Y. Kang
>>>>>>> Contractor
>>>>>>> Principal Engineer
>>>>>>> Phone: 301-713-3333 x201
>>>>>>> Cell: 240-338-1059
>>>>>>>           
>>>>>>>               
>>     


-- 
Chi Y. Kang
Contractor
Principal Engineer
Phone: 301-713-3333 x201
Cell: 240-338-1059