[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20011103: Characterization of data load on LDM and predicting impacts on LDM queue load



>To: address@hidden
>From: Gregory Grosshans <address@hidden>
>Subject: Characterization of data load on LDM and predicting impacts on LDM 
>queue load
>Organization: NOAA/SPC
>Keywords: 200111022252.fA2MqI102912 LDM performance

Gregg,

> Thanks for the response.  We are using LDM 5.1.2.  In regards to
> pbuf_flush log messages, when should one become concerned about
> them, if at all?  You also mentioned Unidata will be watching
> motherlode to see how it handles the increased radar data and if it
> can't handle it you may farm out some of the data and processing to
> another machine.  How will you determine that motherlode can't
> handle the increased data (e.g. corrupted gempak decoded files like
> metar, a larger number of pbuf_flush messages, the load on the
> system climbing high)?

It may be that you are getting pbuf_flush messages as an artifact of
trying to process very large products through pqact.  The code in
pqact that's writing the messages is just testing if it takes longer
than 1 second to write a product, either to a file or to a pipe, and
if so, it emits the message.  I think this code was written back when
no products were bigger than about 20Kbytes, so maybe the arbitrary 1
second threshold needs to be larger.  I think the intent was just to
indicate when pqact was falling behind, either because of slow writes
or slow decoders.  It looks like almost all you pbuf_flush messages
would go away if that threshold were set to 10 seconds instead of 1
second.  The relevant code is in pqact/pbuf.c, line 190:

#ifdef INSTRUMENT
        gettimeofday(&afta, 0);
        diff = diff_timeval(&afta, &b4);
        if(diff.tv_sec > 1)
        {
                uerror("pbuf_flush %d: time elapsed %3ld.%06ld",
                        buf->pfd,
                        diff.tv_sec, diff.tv_usec
                        ); 
        }
#endif

It looks like another way to eliminate these messages from your log
file would be to undefine the "INSTRUMENT" macro and recompile, but I
haven't tested that.

But I wouldn't worry too much about these messages; the products are
getting filed/decoded, but it's taking a while.  You might look for
ways to put less load on pqact or the processes it calls by filing
products in larger batches or optimizing the decoders you are using or
doing some of the processing on a different machine.

We would probably determine that motherlode couldn't handle the load
by seeing product latencies climb uniformly to all downstream sites, or
by noticing that pqact couldn't keep up with handling all the products
in the queue in a timely fashion.  Determining when a product is first
inserted in the queue and how much later pqact finishes with it can be
done with verbose logging.

--Russ