[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[TIGGE #LGY-600646]: Re: Missing fields from CMA



YangXin, Manuel, Baudouin, and Doug:

The following is a bulleted-form outline of the discussions that were
held here in Unidata last Friday and today concerning the data reception
efforts at the CMA.

Unidata TIGGE Data Relay Review 20070420


I. Comments on CMA setup from Baudouin, YangXin, etc.:

  1. Port 388 traffic is not volume limited (commonly known as packet shaped)
     by CMA

  2. FTP has been used to send large volumes of data from CMA to ECMWF

  3. There are two large users of the 100 Mbps link from CMA to CSTNET, those
     being FTP and LDM.  FTP uses 0.75 GB/hr for 2 hours 4 times a day.


II. What have learned from our investigations:

  1. CMA LDM/IDD system setup:

     * 3 GB LDM queue which holds approx. 400 seconds of data

     * 4 GB RAM

     * 4 x 3 Ghz dual-core, 64-bit Xeon processors

     * RedHat Enterprise 4.0, 2.6.9-42 kernel

     * 11 redundant feed requests (10 + one for manifest) to 2 sites,
       ECMWF and NCAR

       o ECMF and EGRR products

     * 11 feed requests to NCAR by IP address

       o KWBC products

     * high number of PERL script "resend" requests (up to 80 running
       concurrently) invoked from pqact on receipt of "missing" product
       identifier products sent by ECMWF. Each of these invokes pqinsert,
       increasing number of queue write locks and decreasing residency of
       real-time data.

  2. ECMWF was pacing BABJ data (FTPed from CMA) into their LDM queue to
     backfill archive

  3. Port 388 traffic is being packet shaped

     * are able to transfer 30-40 times the amount of data using port 8080
       (NB: don't know if port 8080 traffic is also being shaped; if it is
       it is being done so less strictly than that for port 388)

  4. Data Processing is being done on LDM relay machines at all TIGGE centers

  5. On various days at various times we have seen high packet loss on
     CMA-to-NCAR link (iperf tests)

  6. CMA can receive ECMWF products faster from NCAR than directly from ECMWF


III. What we need to help troubleshoot data transfers to/from CMA

  1. Bandwidth usage statistics (time series plots of network bandwidth use)

     * GLORIAD website (http://www.gloriad.org) has some usage plots, but
       much of the website has not been updated since 2004 so information
       is suspect

  2. Other (list will likely evolve as we learn more)


IV. Implications of what has been observed

  1. Small LDM product queue on CMA machine:

     * lots of 2nd trip products (products ingested more than once) received
       from ECMWF:  It is likely that the upstream LDM processes at ECMWF
       are reading products from the older end of the queue because the
       processes are using the ALTERNATE transfer-mode.  The downstream LDM
       processes at CMA are probably requesting those products because
       of the small product-queue at CMA.  This is the likely cause of
       CMA appearing to receive more data than ECMWF injects.

  2. Packet Shaping on Port 388:

     * LDM can not be used out-of-the-box

     * could be extended (by whom?) to other ports as high usage continues
       (according to the GLORIAD website, the TIGGE data movement dominates
       the traffic on GLORIAD links)

  3. Pacing of CMA data into ECMWF LDM queue:

    * CMA data lulls (periods when no data is received at the CMA) was likely
      caused by there being no new data inserted into ECMWF queue that
      would otherwise flow to CMA (i.e., CMA requests for data would (and 
should)
      not include their own (aka BABJ) data).


V. UPC Recommendations

  1. CMA

     * increase system RAM to _at least_ 16 GB

     * increase LDM queue size to _at least_ 12 GB (dependent on addition of 
RAM)

     * remove redundant feed requests (at least until LDM queue is large
       enough to detect and reject redundant data)

     * contact CSTNET and to have packet shaping for port 388 found and removed

     * setup a connection between Unidata and CSTNET

     * run UPC's "uptime" script so that we can view time-series plots of
       operational parameters (we can install this monitoring tool whenever
       permission is granted)

     * possibility: do not attempt to process "missing" requests from other
       TIGGE centers to re-insert data into queue until data flow issues have
       been resolved. Done by invoking PERL script from pqact to pace data into
       local queue (also shortens residency time in queue)

  2. ECMWF and NCAR

     * increase LDM queue size to at least 12 GB

       o this will require in-depth investigation of problems seen at ECMWF
         when using a queue larger than 4 GB

     * install a development system on the ECMWF machine currently being used
       to ingest and relay data

       If memory serves correctly, the size of the LDM product queue used during
       initial throughput testing from ECMWF to NCAR in January of 2006 was
       much larger than what is currently being used at ECMWF.  Since the 
operating
       system on the current ECMWF LDM/IDD machine is the same as what was being
       used during the original tests, and since the configuration of the 
machine
       (RAM, etc.) is more-or-less the same as the machine that was used for
       the original tests, we are at a loss for why the current machine performs
       so poorly with LDM queues greater than 4 GB.  Having the ability to
       build the LDM from source on the machine running the relay would be
       an invaluable tool for troubleshooting the small queue restriction.

     * agree that only one site (ECMWF or NCAR, not both) should request data
       from the CMA --- since CMA's queue only holds 400 seconds, and port 388
       throughput is limited, it is unlikely that either center will be able to
       transfer data successfully

  3. All centers

     * offload data processing to machine(s) other than the one(s) being used
       for data transfers

     * separate ingest and send processing (LDM queue residency time for
       ingested data is being affected by insertion of local products)


VI. Future Considerations

  1. TIGGE topology review and possible redesign to accommodate more
     participants (e.g., Australian BoM, Brazil's INPE/CPTEC, etc.)

  2. TIGGE product resend review and possible redesign


VII. Comments

  1. Problems seen at CMA are likely caused by:

     * attempts to ingest data through a port whose traffic is being 
artificially
       limited (packet shaped).  Successful demonstration of ingesting through
       port 8080 reinforces the notion that efforts must be made to locate the
       source of the packet shaping and lobby to have it removed.

     * attempts to ingest high volumes of data into a too small LDM product
       queue while inserting local data into the same queue for transmit to
       downstream sites is most likely the cause for receipt of 2nd trip
       products

  2. As more sites participate in TIGGE, the problems currently being seen will
     likely grow.  Segregating ingest (product request from upstream sites)
     activities from feed (insertion of local products into LDM queue for
     transmit to downstream sites) should mitigate these problems.


We welcome discussion of any/all of the comments made above.  Please let us
know if you would like a detailed explanation of any comment or recommendation
made.

Cheers,

Tom
****************************************************************************
Unidata User Support                                    UCAR Unidata Program
(303) 497-8642                                                 P.O. Box 3000
address@hidden                                   Boulder, CO 80307
----------------------------------------------------------------------------
Unidata HomePage                       http://www.unidata.ucar.edu
****************************************************************************


Ticket Details
===================
Ticket ID: LGY-600646
Department: Support IDD TIGGE
Priority: Normal
Status: Open