[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #VWS-137049]: IO trashing 6.7.1 client against 6.9.7 server



Hi Daryl,

> I've been running into an issue with a downstream 6.7.1 client causing my
> server to just start trashing IO and eventually grind the data flow to a
> stop.  For example, here is a sysstat output from this morning:
> 
> 08:20:01 AM     CPU     %user     %nice   %system   %iowait    %steal
> %idle
> 08:30:02 AM     all      4.07      0.00      8.36      7.35      0.00
> 80.22
> 08:40:01 AM     all      4.33      0.00      9.33      7.08      0.00
> 79.26
> 08:50:02 AM     all      5.49      0.00     13.13     20.69      0.00
> 60.69
> 09:00:02 AM     all      6.05      0.00     23.20     52.89      0.00
> 17.86
> 09:10:01 AM     all      5.40      0.00     22.39     56.16      0.00
> 16.05
> 09:20:03 AM     all      2.39      0.00     19.45     63.24      0.00
> 14.92
> 09:30:03 AM     all      1.88      0.00     19.11     64.66      0.00
> 14.34
> 
> at 8:50 Z , the train comes off the tracks.  This was when a downstream
> 6.7.1 host connected.  The system gets behind, but doesn't log anything
> too interesting other than simple things like:
> 
> Jun  5 11:16:27 metfs1 pqact[8693] WARN: Processed oldest product in
> queue: 6390.94 s

The message from pqact(1) indicates that its process is way behind: if the 
process had sufficient resources, then it would be working on recently-arrived 
products and not one that's almost two hours old.

> At this time, IO is pegged.  My raid array maxes out around 4,000 TPS.  So
> I wake up and try to stop LDM and this logs for all connected hosts.

Is the LDM product-queue in question on a RAID? We've had mixed results doing 
that: sometimes it works and sometimes it doesn't. An easy thing to try would 
be to move the product-queue to local disk to see if the situation improves. 
Can you do that?

> Jun  5 12:00:23 metfs1 cumulus.dmes.fit.edu(feed)[5493] ERROR: fcntl
> F_RDLCK failed for rgn (0 SEEK_SET, 4096) 4: Interrupted system call
> 
> I assume this is for some more harsh shutdown of LDM to get it to stop.

I haven't seen this particular error-message, but your analysis seems likely.

> Anyway, I comment out the allow for the downstream 6.7.1 host and start
> ldm back up, no more IO thrashing.
> 
> Any ideas about this?  Is there some known issue with old ldm clients and
> 6.9 servers?

I'm not aware of any such issue and there shouldn't be any such issue. The LDM 
protocol and handling of data-products didn't change between 6.7 and 6.9.

> Perhaps this is why unidata still runs pre-6.9 ldm on most
> of its systems? :)

I'm not in charge of the IDD, so I couldn't say.

> daryl
> 
> --
> /**
> * Daryl Herzmann
> * Assistant Scientist -- Iowa Environmental Mesonet
> * http://mesonet.agron.iastate.edu
> */
Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: VWS-137049
Department: Support LDM
Priority: Normal
Status: Closed