[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[IDD #OMZ-874415]: GOESR ldm feed

Hi Carol,

> I'm glad Greg and I were able to create something they you've never seen
> before!

We're not ;-)  Actually, this has been very interesting for me, at least.

> --is it possible that the ordering of making a new queue was different than
> what you reported?
> Yes and no.
> I confirmed with Greg what we did yesterday.
> 1. restart ldm using 'ldmadmin restart'
> 2. edit the queue size to 8 GBs
> 3. restart ldm using 'ldmadmin restart'
> 4. found that the feed stopped a second time
> 5. stop ldm, delqueue, mkqueue, start ldm, etc.
> 6. we haven't any issues since deleting the queue.

Ah Ha!  This fits with the only thing we could come up with wrt not
being able to write to the queue because all slots were locked.
The good news is that after the LDM was stopped and the queue deleted
and remade, and the LDM restarted, there have been no hiccups/errors

> I would say we did step 5 sometime after 4 pm MDT. Based on this, your
> conclusion explaining the "new queue, 8G in your case, should have remedied
> the problem" is correct!

Yup.  To be clear: it was the deleting and remaking of the queue that
fixed the problem.  Increasing the queue size from 500M to 8G was done
so that the queue has about an hour of data in it.  This is so that
your LDM can detect duplicate products that might be resent in the
feed within an hour of their original transmission.  We are NOT
expecting this situation since there is only one source of the
SATELLITE (aka DIFAX) products, but in the future there may be
sources, and then one could experience so-called "second trip" products.

> Also, since this seems like an investigation haha.... I said "Ok we've
> increased the queue size to 8 GB" at 3:22 pm. I didn't mention that we
> deleted the queue. You had told me to run the stop, delqueue, mkqueue,
> start, but we had only run a restart.

That is what we decided was the only possible explanation.

> After some thinking between Greg and
> I, we realized that we forgot to delete the queue sometime between 3:22 pm
> and 4:15 pm. I would say my reporting was poor here, so that's why there
> was added confusion.

It was a good brain teaser :-)

> Good points about the logs. When we have some time, we'll take a look into
> that.

Very good.

> Finally, any thoughts on how to avoid the corrupted queue in the first
> place?

Actually no.  We run LDMs on our data server machines for LONG stretches
of time without experiencing any problems.  We have _never_ experienced
a situation where LDM processes were unable to write into an existing
queue due to all regions being locked, so we have no advice that we can

> Also, the grbfile.sh script has been super helpful in getting this
> all setup.

Excellent, I'm glad that it has helped.

> Thanks again for all your efforts into investigating this error yesterday.
> It is much appreciated because we were very confused.

We were really confused also, and that is why all three of us hashed this
out as a group.  Just so you know, we wouldn't have been able to come to
the guesstimate/conclusion that we did if we didn't have login access
to your machine!

One last thing:

- I notice that GRB ingest and CSPP GEO processes are still running on typhoon

  Is this by design, or is turning it off something on the list of things
  to do?


Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
Unidata HomePage                       http://www.unidata.ucar.edu

Ticket Details
Ticket ID: OMZ-874415
Department: Support IDD
Priority: Normal
Status: Closed
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.