[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #CZE-486853]: GFS grid question



Hi,

re:
> I have these 2 commands running
> notifyme -v -f ANY -h idd.unidata.ucar.edu -l idd.log -p gfs
> notifyme -v -f ANY -h localhost -l onnoaaport -p gfs

Very good.

re:
> grep f306 idd.log | grep 20210826 | grep gfs | wc
>    1430   18556  390100
> grep f306 onnoaaport | grep 20210826 | grep gfs | wc
>    2976   38596  812168
> 
> To me that means I am seeing roughly double the grids on localhost
> (which is on-noaaport) than on idd.unidata.ucar.edu.
> Do you know why this would be happening?

Assuming that the 'notifyme' invocations ran/have run over the
exact same period, then it is very strange that your local
count would be on the order of twice of what is available from
the idd.unidata.ucar.edu top level IDD relay cluster machines.
This would typically indicate that the queue size on the local LDM
was too small, and this was resulting in receipt of second trip
products, but since you are only REQUESTing from a single upstream,
and since all real-server backends of the upstream in this case,
idd.unidata.ucar.edu, are running with 160 GB LDM queues, you should
not be getting second trip products.

Since I think we have discussed LDM queue size with you before, and I
think I recall you saying that the LDM queue size on on-dsserve is
on the order of 32 GB, I decided to dig deeper into what may be happening -
I checked to see which real-server backend your machine is being connecting
to for its CONDUIT feed, and I found that node3.unidata.ucar.edu is currently
feeding on-dsserve.ssec.wisc.edu.

Here are the ssec.wisc.edu machines that are being fed CONDUIT from
idd.unidata.ucar.edu:

node0:

2086753 6 feeder nps1.ssec.wisc.edu 20210826160557.363689 TS_ENDT {{CONDUIT, 
"[05]$"}} primary
2086754 6 feeder nps1.ssec.wisc.edu 20210826160557.363902 TS_ENDT {{CONDUIT, 
"[16]$"}} primary
2086755 6 feeder nps1.ssec.wisc.edu 20210826160557.364594 TS_ENDT {{CONDUIT, 
"[38]$"}} primary
2086756 6 feeder nps1.ssec.wisc.edu 20210826160557.364732 TS_ENDT {{CONDUIT, 
"[32]$"}} primary
2086757 6 feeder nps1.ssec.wisc.edu 20210826160557.364816 TS_ENDT {{CONDUIT, 
"[49]$"}} primary
2086758 6 feeder nps1.ssec.wisc.edu 20210826160557.364704 TS_ENDT {{CONDUIT, 
"[27]$"}} primary
2086759 6 feeder nps1.ssec.wisc.edu 20210826160557.364900 TS_ENDT {{CONDUIT, 
"[242]$"}} primary
2086760 6 feeder nps1.ssec.wisc.edu 20210826160557.365722 TS_ENDT {{CONDUIT, 
"[130]$"}} primary

node2:

1934367 6 feeder nps2.ssec.wisc.edu 20210826161124.932914 TS_ENDT {{CONDUIT, 
"[05]$"}} primary
1934368 6 feeder nps2.ssec.wisc.edu 20210826161124.933315 TS_ENDT {{CONDUIT, 
"[38]$"}} primary
1934369 6 feeder nps2.ssec.wisc.edu 20210826161124.933561 TS_ENDT {{CONDUIT, 
"[27]$"}} primary
1934370 6 feeder nps2.ssec.wisc.edu 20210826161124.933861 TS_ENDT {{CONDUIT, 
"[242]$"}} primary
1934371 6 feeder nps2.ssec.wisc.edu 20210826161124.933614 TS_ENDT {{CONDUIT, 
"[16]$"}} primary
1934373 6 feeder nps2.ssec.wisc.edu 20210826161124.934057 TS_ENDT {{CONDUIT, 
"[49]$"}} primary
1934372 6 feeder nps2.ssec.wisc.edu 20210826161124.934083 TS_ENDT {{CONDUIT, 
"[130]$"}} primary
1934375 6 feeder nps2.ssec.wisc.edu 20210826161124.934327 TS_ENDT {{CONDUIT, 
"[32]$"}} primary

node3:

1539276 6 feeder on-dcserve.ssec.wisc.edu 20210825235324.633946 TS_ENDT 
{{NGRID|CONDUIT, ".GFS*"}} primary
1539277 6 feeder on-dcserve.ssec.wisc.edu 20210825235324.644884 TS_ENDT 
{{NGRID|CONDUIT, ".gfs*"}} primary

node4:

1919029 6 feeder npserve1.ssec.wisc.edu 20210826155401.271648 TS_ENDT 
{{CONDUIT, "[16]$"}} primary
1919031 6 feeder npserve1.ssec.wisc.edu 20210826155401.273073 TS_ENDT 
{{CONDUIT, "[27]$"}} primary
1919034 6 feeder npserve1.ssec.wisc.edu 20210826155401.272559 TS_ENDT 
{{CONDUIT, "[242]$"}} primary
1919030 6 feeder npserve1.ssec.wisc.edu 20210826155401.272958 TS_ENDT 
{{CONDUIT, "[32]$"}} primary
1919035 6 feeder npserve1.ssec.wisc.edu 20210826155401.271657 TS_ENDT 
{{CONDUIT, "[49]$"}} primary
1919036 6 feeder npserve1.ssec.wisc.edu 20210826155401.273332 TS_ENDT 
{{CONDUIT, "[130]$"}} primary
1919038 6 feeder npserve1.ssec.wisc.edu 20210826155401.273872 TS_ENDT 
{{CONDUIT, "[05]$"}} primary
1919039 6 feeder npserve1.ssec.wisc.edu 20210826155401.275002 TS_ENDT 
{{CONDUIT, "[38]$"}} primary

node5:

1749046 6 feeder romulus.ssec.wisc.edu 20210826000524.783862 TS_ENDT {{CONDUIT, 
"[32]$"}} primary
1749047 6 feeder romulus.ssec.wisc.edu 20210826000524.783806 TS_ENDT {{CONDUIT, 
"[05]$"}} primary
1749048 6 feeder romulus.ssec.wisc.edu 20210826000524.783508 TS_ENDT {{CONDUIT, 
"[38]$"}} primary
1749050 6 feeder romulus.ssec.wisc.edu 20210826000524.783819 TS_ENDT {{CONDUIT, 
"[16]$"}} primary
1749051 6 feeder romulus.ssec.wisc.edu 20210826000524.783631 TS_ENDT {{CONDUIT, 
"[49]$"}} primary
1749052 6 feeder romulus.ssec.wisc.edu 20210826000524.783928 TS_ENDT {{CONDUIT, 
"[130]$"}} primary
1749053 6 feeder romulus.ssec.wisc.edu 20210826000524.783738 TS_ENDT {{CONDUIT, 
"[27]$"}} primary
1749054 6 feeder romulus.ssec.wisc.edu 20210826000524.783679 TS_ENDT {{CONDUIT, 
"[242]$"}} primary
1749095 6 feeder npserve2.ssec.wisc.edu 20210826000524.813062 TS_ENDT 
{{CONDUIT, "[05]$"}} primary
1749100 6 feeder npserve2.ssec.wisc.edu 20210826000524.807008 TS_ENDT 
{{CONDUIT, "[32]$"}} primary
1749094 6 feeder npserve2.ssec.wisc.edu 20210826000524.811143 TS_ENDT 
{{CONDUIT, "[130]$"}} primary
1749096 6 feeder npserve2.ssec.wisc.edu 20210826000524.815204 TS_ENDT 
{{CONDUIT, "[16]$"}} primary
1749104 6 feeder npserve2.ssec.wisc.edu 20210826000524.826388 TS_ENDT 
{{CONDUIT, "[38]$"}} primary
1749105 6 feeder npserve2.ssec.wisc.edu 20210826000524.835181 TS_ENDT 
{{CONDUIT, "[49]$"}} primary
1749108 6 feeder npserve2.ssec.wisc.edu 20210826000524.846593 TS_ENDT 
{{CONDUIT, "[242]$"}} primary
1749107 6 feeder npserve2.ssec.wisc.edu 20210826000524.843118 TS_ENDT 
{{CONDUIT, "[27]$"}} primary

Some comments about the REQUEST lines on the various machines are in order:

- the REQUESTs from on-dsserve look OK, but they could be consolidated:

  The consolidation and cleanup would be done by:

change:

REQUEST NGRID|CONDUIT ".GFS* idd.unidata.ucar.edu
REQUEST NGRID|CONDUIT ".gfs" idd.unidata.ucar.edu

to:

REQUEST NGRID|CONDUIT "(gfs|GFS)" idd.unidata.ucar.edu

  Do I think that this is the cause of your seeing twice the number
  of GFS products on your machine than are on the upstream?  No,
  I don't, but cleaning things up is always a good idea.

- now, about the other REQUESTs:

  It looks like someone was attempting to split the CONDUIT feed
  REQUESTs, and, at first glance, nothing bothered me.  A closer
  look, however, shows that the way that the split was done 
  on all of the other ssec.wisc.edu machines is not what I think
  that was intended.  Let's look at one set as an example:

2086753 6 feeder nps1.ssec.wisc.edu 20210826160557.363689 TS_ENDT {{CONDUIT, 
"[05]$"}} primary
2086754 6 feeder nps1.ssec.wisc.edu 20210826160557.363902 TS_ENDT {{CONDUIT, 
"[16]$"}} primary
2086755 6 feeder nps1.ssec.wisc.edu 20210826160557.364594 TS_ENDT {{CONDUIT, 
"[38]$"}} primary
2086756 6 feeder nps1.ssec.wisc.edu 20210826160557.364732 TS_ENDT {{CONDUIT, 
"[32]$"}} primary
2086757 6 feeder nps1.ssec.wisc.edu 20210826160557.364816 TS_ENDT {{CONDUIT, 
"[49]$"}} primary
2086758 6 feeder nps1.ssec.wisc.edu 20210826160557.364704 TS_ENDT {{CONDUIT, 
"[27]$"}} primary
2086759 6 feeder nps1.ssec.wisc.edu 20210826160557.364900 TS_ENDT {{CONDUIT, 
"[242]$"}} primary
2086760 6 feeder nps1.ssec.wisc.edu 20210826160557.365722 TS_ENDT {{CONDUIT, 
"[130]$"}} primary

  This setup is actually REQUESTing the same products multiple times.
  the receiving LDM should, if its queue is large enough, eliminate
  the duplicates received since it is a product's MD5 signature that
  is the attribute for a newly received product that is checked against
  the list of products currently in the LDM queue.

  I strongly recommend that the CONDUIT REQUESTs on all of the other SSEC 
machines
  be redone so that there are N mutually exclusive REQUESTs whose union will
  be all of the products in the CONDUIT feed.  For instance:

REQUEST         CONDUIT         "[0]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[1]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[2]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[3]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[4]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[5]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[6]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[7]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[8]$"  idd.unidata.ucar.edu
REQUEST         CONDUIT         "[9]$"  idd.unidata.ucar.edu

  This will split the requests into 10 mutuall-exclusive REQUESTs
  and the union of the REQUESTs is all of the products in CONDUIT.

re:
> On the XCD side, there was no grb file for forecast hour 306 created on
> on-noaaport. On nps2, we are seeing grib files for forecast hour 306 and
> from what I can tell the full compliment.
> XCD spool files for CONDUIT are set to the same size 1936 megabytes.

OK.  It is near impossible for us to comment on results on the XCD side.

re:
> Jerry reminded me on-dcserve is a VM, so we are going to start to see if
> something is at the system level.

We ran into one situation where an LDM running in a VM on a user's
machine was having a hard time keeping up with the feeds that were
being REQUESTed.  We were allowed to logon to the user's machine/VM
and found that the file system that was being written to was on an
SSD which you would assume would be very fast, but the SSD was simply
worn out, so writes were glacially slow, and this was causing all sorts
of problems with processing the data being received by the LDM.  Is
it likely that a similar situation exits on your machine?  We have
no way of saying.

Cheers,

Tom
--
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: CZE-486853
Department: Support LDM
Priority: Normal
Status: Closed
===================
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata 
inquiry tracking system and then made publicly available through the web.  If 
you do not want to have your interactions made available in this way, you must 
let us know in each email you send to us.