Re: [ldm-users] LDM Metrics

Eric,

Gilbert did a good job of explaining the plots. I'll just take this
opportunity to add some minor refinements.

> Number of Bytes Use in Queue vs. Time
> 
> Umm...the number of bytes used in the queue. Basically this is kind of
> deceiving. It should always be very very close to full. I can tell you
> have a 6G Queue. and it's usually full, which means you have a stable
> queue. Theoretically if you're not using your full queue you could
> reduce your size, I think that's the theory behind this, but I think the
> inherent problem with LDM is it only clears space when it's needed...at
> least that's how I understand it. So this has good intentions to help
> you know if you can get away with a smaller queue, but I don't think it
> provides the greatest analysis...I like the Age vs Time much better...

I used to use this plot in order to determine if the product-queue was
limited by the amount of data that it could hold or by the number of
products that it could hold. With the advent of the queue-reconciliation
feature in LDM 6.9, the importance of this plot has diminished.

> Space vs Time - Memory consumption. Pretty obvious. I won't elaborate.

The important line is the "Used (Memory + Swap)" one. If it gradually
increases over weeks, then the LDM might have a memory leak.

> Number of Products in Queue vs. Time
> 
> Ok, this is my #2. It tells you how many products you currently hold in
> queue. So if this graph levels off (flat lines)...that means you've
> stopped receiving products. It should reset to zero when you remake the
> queue but NOT necessarily when you restart. Only after a delqueue
> command (because you've deleted the queue and thus all items in it!).

Like the "Number of Bytes Used in Queue vs. Time" plot, the importance
of this plot has diminished for the same reason.

> Age of Oldest Product in Queue vs. Time
> 
> This is my #1 most important graph. This is your recovery window. It is
> the age in hours of your oldest product.
> 
> So lets say you have a crash downstream and lose a host. To get it back
> up without losing any data you have to know how long data is held in the
> LDM queue upstream. The critical point are the low ones, because at
> those times you have the shortest window. So based on your chart I would
> say you have 3.5 hours to get the host back online without losing any
> information.

This plot also has meaning for the downstream sites: it indicates how
far back they can request data and expect to receive it.

> To utilize this you would tune downstream max_latency and offset
> parameters to look back far enough and ensure you request everything
> from upstream.
> 
> Also, as is with some of our development tasks, we make small queues and
> need to make sure the processing finishes before the items are deleted
> from the queue. So if a process takes an long amount of time as can be
> the case during development you might miss products along the way. This
> is mainly because we've found LDM doesn't exactly multitask, it seems to
> take an Product and traverse the pqact entries. So if one entry takes an
> hour to process , the product is deleted from the queue, then it
> attempts to process the next pqact entry it will never execute because
> it's been deleted. This is observation and experimentation, maybe
> someone else knows better but it's what I've seen.

Yes, pqact(1) processes a product by sequentially matching it against
entries in its configuration-file. The product is "locked", however,
while it's being processed and can't be deleted from the queue.



  • 2011 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the ldm-users archives: