YangXin, Manuel, Baudouin, and Doug: The following is a bulleted-form outline of the discussions that were held here in Unidata last Friday and today concerning the data reception efforts at the CMA. Unidata TIGGE Data Relay Review 20070420 I. Comments on CMA setup from Baudouin, YangXin, etc.: 1. Port 388 traffic is not volume limited (commonly known as packet shaped) by CMA 2. FTP has been used to send large volumes of data from CMA to ECMWF 3. There are two large users of the 100 Mbps link from CMA to CSTNET, those being FTP and LDM. FTP uses 0.75 GB/hr for 2 hours 4 times a day. II. What have learned from our investigations: 1. CMA LDM/IDD system setup: * 3 GB LDM queue which holds approx. 400 seconds of data * 4 GB RAM * 4 x 3 Ghz dual-core, 64-bit Xeon processors * RedHat Enterprise 4.0, 2.6.9-42 kernel * 11 redundant feed requests (10 + one for manifest) to 2 sites, ECMWF and NCAR o ECMF and EGRR products * 11 feed requests to NCAR by IP address o KWBC products * high number of PERL script "resend" requests (up to 80 running concurrently) invoked from pqact on receipt of "missing" product identifier products sent by ECMWF. Each of these invokes pqinsert, increasing number of queue write locks and decreasing residency of real-time data. 2. ECMWF was pacing BABJ data (FTPed from CMA) into their LDM queue to backfill archive 3. Port 388 traffic is being packet shaped * are able to transfer 30-40 times the amount of data using port 8080 (NB: don't know if port 8080 traffic is also being shaped; if it is it is being done so less strictly than that for port 388) 4. Data Processing is being done on LDM relay machines at all TIGGE centers 5. On various days at various times we have seen high packet loss on CMA-to-NCAR link (iperf tests) 6. CMA can receive ECMWF products faster from NCAR than directly from ECMWF III. What we need to help troubleshoot data transfers to/from CMA 1. Bandwidth usage statistics (time series plots of network bandwidth use) * GLORIAD website (http://www.gloriad.org) has some usage plots, but much of the website has not been updated since 2004 so information is suspect 2. Other (list will likely evolve as we learn more) IV. Implications of what has been observed 1. Small LDM product queue on CMA machine: * lots of 2nd trip products (products ingested more than once) received from ECMWF: It is likely that the upstream LDM processes at ECMWF are reading products from the older end of the queue because the processes are using the ALTERNATE transfer-mode. The downstream LDM processes at CMA are probably requesting those products because of the small product-queue at CMA. This is the likely cause of CMA appearing to receive more data than ECMWF injects. 2. Packet Shaping on Port 388: * LDM can not be used out-of-the-box * could be extended (by whom?) to other ports as high usage continues (according to the GLORIAD website, the TIGGE data movement dominates the traffic on GLORIAD links) 3. Pacing of CMA data into ECMWF LDM queue: * CMA data lulls (periods when no data is received at the CMA) was likely caused by there being no new data inserted into ECMWF queue that would otherwise flow to CMA (i.e., CMA requests for data would (and should) not include their own (aka BABJ) data). V. UPC Recommendations 1. CMA * increase system RAM to _at least_ 16 GB * increase LDM queue size to _at least_ 12 GB (dependent on addition of RAM) * remove redundant feed requests (at least until LDM queue is large enough to detect and reject redundant data) * contact CSTNET and to have packet shaping for port 388 found and removed * setup a connection between Unidata and CSTNET * run UPC's "uptime" script so that we can view time-series plots of operational parameters (we can install this monitoring tool whenever permission is granted) * possibility: do not attempt to process "missing" requests from other TIGGE centers to re-insert data into queue until data flow issues have been resolved. Done by invoking PERL script from pqact to pace data into local queue (also shortens residency time in queue) 2. ECMWF and NCAR * increase LDM queue size to at least 12 GB o this will require in-depth investigation of problems seen at ECMWF when using a queue larger than 4 GB * install a development system on the ECMWF machine currently being used to ingest and relay data If memory serves correctly, the size of the LDM product queue used during initial throughput testing from ECMWF to NCAR in January of 2006 was much larger than what is currently being used at ECMWF. Since the operating system on the current ECMWF LDM/IDD machine is the same as what was being used during the original tests, and since the configuration of the machine (RAM, etc.) is more-or-less the same as the machine that was used for the original tests, we are at a loss for why the current machine performs so poorly with LDM queues greater than 4 GB. Having the ability to build the LDM from source on the machine running the relay would be an invaluable tool for troubleshooting the small queue restriction. * agree that only one site (ECMWF or NCAR, not both) should request data from the CMA --- since CMA's queue only holds 400 seconds, and port 388 throughput is limited, it is unlikely that either center will be able to transfer data successfully 3. All centers * offload data processing to machine(s) other than the one(s) being used for data transfers * separate ingest and send processing (LDM queue residency time for ingested data is being affected by insertion of local products) VI. Future Considerations 1. TIGGE topology review and possible redesign to accommodate more participants (e.g., Australian BoM, Brazil's INPE/CPTEC, etc.) 2. TIGGE product resend review and possible redesign VII. Comments 1. Problems seen at CMA are likely caused by: * attempts to ingest data through a port whose traffic is being artificially limited (packet shaped). Successful demonstration of ingesting through port 8080 reinforces the notion that efforts must be made to locate the source of the packet shaping and lobby to have it removed. * attempts to ingest high volumes of data into a too small LDM product queue while inserting local data into the same queue for transmit to downstream sites is most likely the cause for receipt of 2nd trip products 2. As more sites participate in TIGGE, the problems currently being seen will likely grow. Segregating ingest (product request from upstream sites) activities from feed (insertion of local products into LDM queue for transmit to downstream sites) should mitigate these problems. We welcome discussion of any/all of the comments made above. Please let us know if you would like a detailed explanation of any comment or recommendation made. Cheers, Tom **************************************************************************** Unidata User Support UCAR Unidata Program (303) 497-8642 P.O. Box 3000 address@hidden Boulder, CO 80307 ---------------------------------------------------------------------------- Unidata HomePage http://www.unidata.ucar.edu **************************************************************************** Ticket Details =================== Ticket ID: LGY-600646 Department: Support IDD TIGGE Priority: Normal Status: Open
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.