Re: [ldm-users] LDM Metrics

On 06/14/2011 03:29 PM, Gilbert Sebenste wrote:
> On Tue, 14 Jun 2011, Jeff Lake wrote:
> 
>> I have been using the plotMetrics of LDM for a few months now ...
>> http://ldm01.michiganwxsystem.net/vnstat/index.php
>> I'm a bit lost as to what it's telling me..
>> Is my machine healthy??
>> Is there any place I can dummy these up?
> 
> "Dummy these up"? Hmmmm. Not sure what you mean by that..."dummy up"
> means to "shut up", and I don't think you mean that. Anyway...
> 
Agreed...not sure what you mean by dummy...

> I was waiting for an explanation as well from UNIDATA, but right now Steve
> Emmerson is real busy (so busy he didn't even send out the announcement
> from Friday that LDM 6.9.8 is available, and fixes some significant bugs
> for Solaris/Redhat RHEL/CentOS users...but the announcement is available
> on UNIDATA's web site at http://www.unidata.ucar.edu/software/ldm .
> Go get you some!).
> 
Good to know about this...I'll have to take a look to see whats new.

> Anyhoo, I'd like to know more about the various parameters as well. Load
> average is obvious, and as long as the incoming data amount stays
> roughly the same every day, things look good...but beyond that...I don't
> know.
> 
I'll just go chart by chart.



Number of Bytes Use in Queue vs. Time

Umm...the number of bytes used in the queue. Basically this is kind of
deceiving. It should always be very very close to full. I can tell you
have a 6G Queue. and it's usually full, which means you have a stable
queue. Theoretically if you're not using your full queue you could
reduce your size, I think that's the theory behind this, but I think the
inherent problem with LDM is it only clears space when it's needed...at
least that's how I understand it. So this has good intentions to help
you know if you can get away with a smaller queue, but I don't think it
provides the greatest analysis...I like the Age vs Time much better...




Space vs Time - Memory consumption. Pretty obvious. I won't elaborate.





Number of Products in Queue vs. Time

Ok, this is my #2. It tells you how many products you currently hold in
queue. So if this graph levels off (flat lines)...that means you've
stopped receiving products. It should reset to zero when you remake the
queue but NOT necessarily when you restart. Only after a delqueue
command (because you've deleted the queue and thus all items in it!).




CPU Context Switch Rate vs. Time

Context switching means your doing alot of things. To ME, this looks
like a pretty busy box based on this. More of a machine relevant item
than LDM/data related.





Age of Oldest Product in Queue vs. Time

This is my #1 most important graph. This is your recovery window. It is
the age in hours of your oldest product.

So lets say you have a crash downstream and lose a host. To get it back
up without losing any data you have to know how long data is held in the
LDM queue upstream. The critical point are the low ones, because at
those times you have the shortest window. So based on your chart I would
say you have 3.5 hours to get the host back online without losing any
information.

To utilize this you would tune downstream max_latency and offset
parameters to look back far enough and ensure you request everything
from upstream.

Also, as is with some of our development tasks, we make small queues and
need to make sure the processing finishes before the items are deleted
from the queue. So if a process takes an long amount of time as can be
the case during development you might miss products along the way. This
is mainly because we've found LDM doesn't exactly multitask, it seems to
take an Product and traverse the pqact entries. So if one entry takes an
hour to process , the product is deleted from the queue, then it
attempts to process the next pqact entry it will never execute because
it's been deleted. This is observation and experimentation, maybe
someone else knows better but it's what I've seen.

This graph has been priceless for tweaking different use cases.

If you stop receiving products the chart should do a 45 degree  - Bernie
Madoff style growth line....it will also do that whenever you remake the
queue until you max out the size and start replacing items.





LDM Connections vs. Time

The number ldm connections the host has over time. My only complaint
with this is under perfect cirucumstances sometimes the lines end up
matching the top and bottom exactly so it looks empty.




CPU-Modes vs Time & Load-Average vs Time.

Typical CPU information. Looks pretty busy but healthy.





* Disclaimer - These are my interpretations. I was in the LDM training
when Steve introduced these for the first time. If I got something wrong
I hope Steve or someone else will correct me.


-- 
Eric M. Hudish
174 Faith Circle
Boalsburg, PA 16827
Cell: +1.724.977.3314
Goog: +1.814.689.9148

"Duh! To make room for Tuna!"
<http://www.google.com/profiles/eric.hudish>
Search <http://keyserver.pgp.com>