[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Support #SBB-325304]: Re: 20110712: CONDUIT request -- fire weather grids



Hi Justin,

re:
> (Went ahead and dropped the EMC guys off this since this more on the dataflow 
> side).

OK.  I removed the same folks from the CCs here.

re:
> comments are inline...
re: measurably smaller volumes from test CONDUIT datastream than from 
"operational" one

I think the issue was a hung process on our ldm2 system. It looks like about 
30% of our grib inserts were having some errors (but not enough for the process 
to totally fail). I didn't see anything wrong with the number of gribinserts we 
were attempting, there were the same to all three ldm boxes, with ldm2 having a 
few more due to the parallel NAM. But when I did a 'pqmon' on ldm2 the max age 
was 
1/3 of what it is on ldm0. It looks like we had a hung gribinsert from a few 
months 
ago consuming a lot of CPU. Once I killed the process the max age in the queue 
has 
been steadily increasing. I'll be watching the log to see if that took care of 
the 
failed inserts.

OK, this is good news from my perspective.

Questions:

- to be clear: ldm2 is the system on which the fireweather products are being
  inserted?

- what were the errors being reported by gribinsert?

- reference to the parallel NAM is a reference to the fireweather products?
re:
> We are running the 'ldmadmin addmetrics' action every minute via cron.

Excellent!


> We are also running rtstats, doesn't that send some data back to you?

Yes, but rtstats doesn't tell us the age of the oldest product in the
LDM queue, so it does not provide the information needed to judge whether
or not the queue is large enough.

re:
> Doesn't look like we are running gnuplot though, is there an easy way
> dump the stats you would be interested in?

'ldmadmin addmetrics' writes to the file ~ldm/logs/metrics.txt.  That
file has all of the information needed to evaluate the queue size
(through age of the oldest product) and system performance (through
load average, etc.).  It would be useful if you could make that file
(those files since the file will get rotated every so often) available
to us so we can get a good picture of how things are running.  You
can do the same on some other machine on which the LDM is installed
and on which gnuplot is available.

re: merge of GRIB table entries later today
> Ok, Thanks!

No worries.

Given the hiccup you describe above, I will want to continue ingesting the
test CONDUIT datastream for the next few days.  This will tell us:

- if the "hung" gribinsert process was really the cause of lower volumes

- if the queue is too small or sufficient for what is being attempted

- how well your system is performing

In order to do the last item, it would be useful for us to get copies of
your ~ldm/logs/metrics.txt* files today and again on Monday morning.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: SBB-325304
Department: Support CONDUIT
Priority: Normal
Status: Closed