[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Support #WCE-307093]: ldm routing problem and solution



Tom,

> The problem was seen on three systems
> 
> Linux l2-irads.baronservices.com 2.6.9-89.EL #1 Mon Apr 20 10:22:29 EDT 2009 
> x86_64 x86_64 x86_64 GNU/Linux
> CentOS release 4.7 (Final)
> 
> Linux l2-ldm.baronservices.com 2.6.9-55.0.6.EL #1 Thu Aug 23 11:01:08 EDT 
> 2007 x86_64 x86_64 x86_64 GNU/Linux
> Red Hat Enterprise Linux WS release 4 (Nahant Update 5)
> 
> Linux txwx.baron.hsv 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 2009 
> x86_64 x86_64 x86_64 GNU/Linux
> CentOS release 5.3 (Final)
> 
> On the last, the problem is much less severe.  At least I think it is, but I 
> did not actually run ldm there, just the test program.  The test program 
> detected backwards time about 5 times a day, where on the others it is 
> several times an hour.
> 
> The host hardware is a 16-cpu system running a recent version of Vmware ESX. 
> I can get the exact version if you need it - I think it is kept reasonably 
> up-to-date.
> 
> Also, I tried the test program on several systems that were not VMs.  In all 
> cases there was never an instance of time going backwards.  This was over the 
> last 48 hours or so.
> 
> The patch has not shown any bad side effects.  We still may end up moving 
> this to non-vm system but there are many advantages to vm so still pondering 
> what to do.  Other than the time issue, performance is great.

A monotonic system clock is a prerequisite for correct operation of an LDM 
system.  I didn't realise that certain Linux systems were risky in this respect 
until you brought my attention to it.  I sincerely thank you.

Based on what you've told me, I strongly recommend making your clocks monotonic 
by following the advice in

    
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006113

and

    
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006086

The reason for this is to ensure that products are in the queue in the order in 
which they were received (which can't be guaranteed if time runs backwards) so 
that queue-processors (e.g., upstream LDM-s, pqact(1), pqcat(1)) won't miss 
anything.

> At the risk of complicating the issue, there is one other thing.  I think our 
> upstream provider is having an issue that "looks" like the same problem.  I 
> say it looks like it because there are primary/alternate feeds and I cannot 
> imagine the problem happening to two systems at the same time very often.
> 
> We have 8 connections with them, 4 primary and 4 alternate.  You are probably 
> familiar with this but their pqcat lines are of the format
> [ldm@l2-irads ~]$ qcat -o 60 -p KLIX
> Aug 17 04:08:38 pqcat NOTE: Starting Up (29801)
> Aug 17 04:08:38 pqcat INFO:     7659 20090817040741.118 NEXRAD2 395084  
> L2-BZIP2/KLIX/20090817040219/395/84/I/V03/0
> Aug 17 04:08:38 pqcat INFO:     7854 20090817040753.016 NEXRAD2 395085  
> L2-BZIP2/KLIX/20090817040219/395/85/E/V03/0
> 
> From this, one can extract the radar site ID, the volume number (eg 395) and 
> the sequence within the volume (eg 84 or 85).  Also there is info on whether 
> it is the beginning or end of the volume.  So there is enough information to 
> know what files/products are expected.
> 
> I made a script that takes the output of pqcat (open $pq, "pqcat -vl - -o 0 
> -i 2 2>&1 1>/dev/null |") and keeps track of what is expected next.  It logs 
> any discreprencies.  The past few hours are typical and the log is below:
> 
> 16:00:00 RUNNING
> 17:00:01 RUNNING
> 18:00:00 RUNNING
> 18:30:00 KRAX - expected seq 373-60 got 373-61
> 18:30:02 KGSP - expected seq 17-14 got 17-16
> 18:30:02 KPBZ - expected seq 67-17 got 67-19
> 18:30:02 KFCX - expected seq 876-3 got 876-4
> 18:30:02 KLTX - expected seq 489-44 got 489-45
> 18:30:04 KCLE - expected seq 65-60 got 65-61
> 18:30:04 KCBW - expected seq 857-48 got 857-49
> 18:30:04 KRLX - expected seq 534-40 got 534-41
> 18:30:04 KLWX - expected seq 527-10 got 527-11
> 18:30:06 KTYX - expected seq 724-8 got 724-10
> 18:30:06 KAKQ - expected seq 321-40 got 321-41
> 18:30:06 KMHX - expected seq 299-33 got 299-34
> 18:30:08 KCAE - expected seq 657-30 got 657-31
> 18:30:08 KBGM - expected seq 832-28 got 832-29
> 18:30:08 KCLX - expected seq 765-29 got 765-30
> 18:30:10 KENX - expected seq 6-29 got 6-30
> 18:30:12 KILN - expected seq 914-18 got 914-19
> 18:54:27 KLIX - expected seq 281-33 got 281-34
> 18:54:27 KLIX - expected seq 281-35 got 281-33
> 18:54:31 KLIX - expected seq 281-34 got 281-35
> 19:00:00 RUNNING
> 20:00:00 RUNNING
> 
> The RUNNING line is logged at the top of the hour to add context if nothing 
> else happened.  Note how there is a burst of missing files. (The last three 
> are out of order and that may be a different issue).  This pattern has the 
> same signature of what we were seeing on our internal feeds that was solved 
> with the time patch.
> 
> By the way, we are also getting the feed from another provider and the 
> missing files here are usually (always?) present there so it does not appear 
> to be a source problem.  Also BTW, I've double checked to make sure the files 
> were not in the queue and the logging program somehow missed it.
> 
> Every few minutes, there is a primary/alternate switch, but I cannot see any 
> correlation with the missing files. The only way I could see it being the 
> same problem would be if there was actually an upstream to our upstream that 
> fed both their primary and alternate and the problem was there. [Also tried 
> it with alternates commented out and still similar results].
> 
> Later I hope to coordinate with them: when we see a missing file, get on the 
> phone and have them see if it is in their queue.  If it is not, then they can 
> do the same thing with their upstream.

Find out if they're running a version of Linux that is susceptible to the 
system clock running backwards sometimes.

> Tom


Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: WCE-307093
Department: Support LDM
Priority: Normal
Status: Closed