[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20040920: Possible pqact issue in LDM?



Steven,

>Date: Mon, 20 Sep 2004 14:35:13 -0500
>From: "Steven Danz" <address@hidden>
>Organization: Aviation Weather Center
>To: Steve Emmerson <address@hidden>
>Subject: Re: 20040920: Possible pqact issue in LDM?
>Keywords: 200409091803.i89I3pnJ023109

The above message contained the following:

> ...  So far a missed
> product doesn't show up in the log when running in real-time.  Running
> it by hand after the fact and it shows up fine.

The first thing that pqact(1) does with every product is to log
it to the logfile (when in verbose mode).  This is done before
executing any action.  Consequently, it's very hard to believe that
the manually-executed pqact(1) does this but the LDM pqact(1) doesn't
because this would mean that the two pqact(1)s aren't matching the same
data-products.  A more believable hypothesis is that the data-product in
question hasn't yet arrived in the product-queue (but see below).

> So you are saying that it takes over 600 seconds (10 minutes) on a
> system with a load of ~0.05 to act on a product once it is received in
> the queue?

No.  I'm saying that it can take more than ten minutes from the time
that the AWC transmits a data-product destined for NOAAPORT to the time 
that it's received by the AWC's LDM from NOAAPORT.

> I thought the SIGCONT that the pq_insert() generates would 'kick' the
> pqact into action alot sooner than that...

It will.

> Also, the 'missed' product is bracketed by 'caught' products in every
> case so far.  Which was another cause for concern.  So I would guess
> it should be taking care of these things in order, so it should have
> caught the missed product.

Well, that kills my "more believable hypothesis" scenario, above.

I hope you're sure about this.

Are there any reconnections by the monitoring LDM system that's
downstream from the NorthupGrumman NOAAPort system at this time?

Also, as I explained in an email on Mon, 13 Sep 2004 15:44:34 -0600,
a data-product that was inserted into the product-queue just after
the system clock was set backwards could be missed by a reader of the
product-queue.  The behavior of the LDM system in this case would be
consistent with everything that you've related.

Is the clock on the NorthupGrumman NOAAPort system (on which the LDM is
running) kept accurate somehow?  Is an ntpd(8) daemon running?  Does
root's crontab(1) execute ntpdate(1) periodically?

> I'm wondering if there isn't something 'bad' happening because
> of how the ingesters were written.  I find it odd that they
> pq_open()/pq_close() for each product inserted, not just once for the
> duration of the program.

These LDM ingesters are getting data-products from the AWIPS CP
software?  Are they started by EXEC entries in the LDM
configuration-file?

I can't see how opening and closing the product-queue for every
data-product would cause the problem you're seeing, but that mechanism
is grossly inefficient -- especially for large, memory-mapped files --
and should be corrected ASAP.

> The way the pqact is behaving, I'm wondering about signals and if
> something isn't causing pqact to 'jump' away from its normal routine
> and miss a section of the queue.

I don't see how.

Waiting with bated breath,
Steve Emmerson