[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #VWS-137049]: IO trashing 6.7.1 client against 6.9.7 server



Hi Daryl,

> I've been running into an issue with a downstream 6.7.1 client causing my
> server to just start trashing IO and eventually grind the data flow to a
> stop.  For example, here is a sysstat output from this morning:
> 
> 08:20:01 AM     CPU     %user     %nice   %system   %iowait    %steal
> %idle
> 08:30:02 AM     all      4.07      0.00      8.36      7.35      0.00
> 80.22
> 08:40:01 AM     all      4.33      0.00      9.33      7.08      0.00
> 79.26
> 08:50:02 AM     all      5.49      0.00     13.13     20.69      0.00
> 60.69
> 09:00:02 AM     all      6.05      0.00     23.20     52.89      0.00
> 17.86
> 09:10:01 AM     all      5.40      0.00     22.39     56.16      0.00
> 16.05
> 09:20:03 AM     all      2.39      0.00     19.45     63.24      0.00
> 14.92
> 09:30:03 AM     all      1.88      0.00     19.11     64.66      0.00
> 14.34
> 
> at 8:50 Z , the train comes off the tracks.  This was when a downstream
> 6.7.1 host connected.  The system gets behind, but doesn't log anything
> too interesting other than simple things like:
> 
> Jun  5 11:16:27 metfs1 pqact[8693] WARN: Processed oldest product in
> queue: 6390.94 s

The message from pqact(1) indicates that its process is way behind: if the 
process had sufficient resources, then it would be working on recently-arrived 
products and not one that's almost two hours old.

> At this time, IO is pegged.  My raid array maxes out around 4,000 TPS.  So
> I wake up and try to stop LDM and this logs for all connected hosts.

Is the LDM product-queue in question on a RAID? We've had mixed results doing 
that: sometimes it works and sometimes it doesn't. An easy thing to try would 
be to move the product-queue to local disk to see if the situation improves. 
Can you do that?

> Jun  5 12:00:23 metfs1 cumulus.dmes.fit.edu(feed)[5493] ERROR: fcntl
> F_RDLCK failed for rgn (0 SEEK_SET, 4096) 4: Interrupted system call
> 
> I assume this is for some more harsh shutdown of LDM to get it to stop.

I haven't seen this particular error-message, but your analysis seems likely.

> Anyway, I comment out the allow for the downstream 6.7.1 host and start
> ldm back up, no more IO thrashing.
> 
> Any ideas about this?  Is there some known issue with old ldm clients and
> 6.9 servers?

I'm not aware of any such issue and there shouldn't be any such issue. The LDM 
protocol and handling of data-products didn't change between 6.7 and 6.9.

> Perhaps this is why unidata still runs pre-6.9 ldm on most
> of its systems? :)

I'm not in charge of the IDD, so I couldn't say.

> daryl
> 
> --
> /**
> * Daryl Herzmann
> * Assistant Scientist -- Iowa Environmental Mesonet
> * http://mesonet.agron.iastate.edu
> */
Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: VWS-137049
Department: Support LDM
Priority: Normal
Status: Closed


NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.