[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030723: Problem with LDM/NOAAport ingestor



Kevin,

The 4 channel system I wrote reads the incoming data, computes the
MD5 checksum as the data streams in, and then inserts into the queue directly.
This avoids other processes, named pipes, and the like. It also
allows the MD5 to be computed as the data blocks arrive, rather than
waiting for the entire product to arrive and then have
pqing compute the checksum...which is much more important with
26MB satellite images. Also, your PC card probably has a RAM
buffer on it- so if necessary, your card will provide the
buffer space.

Some points here you may want to consider:

1) you didn't say it- but I'm assuming you are using pqing to read from
your named pipes. It sounds like your named pipe would have to be full
in order to drop something. Is your program checking for this
condition? How do you handle it....or do things get dropped on the floor? 

2) You generally don't want a program being dependent onsomething else without
buffering. In your approach, you are loosing the benefit of the
on board memory of the card. Your buffer in the pipe is probably limited.
One alternative is to have your program write to a cyclical file
(buffer), and have a separate process read from the cyclical file and feed the
FIFO....but you would still need to be checking for write errors.


The LDM queue cleaning is generally efficient, pqexpire is much more costly
since it has to search the queue. A fast machine should not be noticing
that overhead. You probably want a larger queue anyhow, since a T-1 is capable
of exceeding 400MB an hour.

Steve Chiswell







>From: "Kevin R. Tyle" <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200307232218.h6NMI8Ld008737

>Hi,
>
>First, this question pertains to work that I am doing as a
>consultant for a non-Unidata member (MESO, Inc.), so I understand this
>might not be the right place to send this, but hey, it's an
>interesting problem.
>
>MESO basically did a "do-it-yourself" installation of a NOAAport
>system.  Besides the appropriate satellite dish/EFR-54 Receiver system,
>we use a 2.6 GHz dual CPU Intel P-4 that is running RH 8.0.  A Cyclades
>PC300 card is used on the PC to receive the data from the receiver.
>Presently, we are only ingesting data on the NCEP/NWSTG channel.
>The PC has three 36 GB SCSI disks, and use the EXT3 logging
>filesystem (although I have experimented placing the LDM
>product queue on its own disk separate from the rest
>of the LDM-related files, on a non-logging ext2 filesystem.)
>
>Our ingest program receives frames from the card, strips out the
>extraneous headers, and basically puts everything into an
>LDM-friendly format.  Depending on the WMO ID, products are
>separated into DDPLUS, HDS, and NMC3 feeds.  The data is output
>into three named pipes, corresponding to the three data feeds.
>The LDM then reads from these named pipes.
>
>Basically, when I start the LDM, everything goes well, for a time.
>All frames are received (we check for sequential frame #'s and
>product ID's).  But, after a certain period of time, say an
>hour or so, we begin to lose frames.  Sometimes a couple, sometimes
>about 10 or so.  And once it starts, it's basically useless until
>the ingestor and LDM are restarted.  If I run the ingestor without
>the LDM (e.g., just cat'ing the named pipes into /dev/null), no
>frame skipping occurs.
>
>I knew I was onto something when I found that when I remade the
>queue, things would always work well for an hour or so.  I began
>to suspect that when the queue reached it's full size, we started
>to see the frame loss.
>
>Here is an example from today.  I started the ingestor at
>1845 UTC.  All goes well for about 90 minutes.  Then, I get
>this in the output from the ingestor:
>
>WMOID = SPAK32, Cat. = 1,LDM sqnm = 688, feed = DDS,Product ID # = 940200
>030723/20:22:45
>
>Previous Frame ID = 634, Current Frame ID = 635
>
>WMOID = UANT01, Cat. = 7,LDM sqnm = 689, feed = DDS,Product ID # = 940201
>030723/20:22:45
>
>Previous Frame ID = 635, Current Frame ID = 636
>
>WMOID = SDUS23, Cat. = 1,LDM sqnm = 690, feed = RAD,Product ID # = 940202
>030723/20:22:45
>
>Previous Frame ID = 636, Current Frame ID = 643
>
>*** BREAK IN FRAME # SEQUENCE!! ***
>
>WMOID = SDUS22, Cat. = 1,LDM sqnm = 691, feed = RAD,Product ID # = 940203
>
>030723/20:22:45
>
>Previous Frame ID = 643, Current Frame ID = 644
>030723/20:22:45
>
>Previous Frame ID = 644, Current Frame ID = 645
>030723/20:22:45
>
>Previous Frame ID = 645, Current Frame ID = 646
>030723/20:22:45
>
>Previous Frame ID = 646, Current Frame ID = 647
>
>WMOID = SDUS51, Cat. = 1,LDM sqnm = 692, feed = RAD,Product ID # = 940205
>
>*** BREAK IN PRODUCT NUMBER SEQUENCE!! ***
>
>Now look at the pqmon output from about that time:
>
>Jul 23 20:22:29 lightning2 pqmon[15276]:  36025     1   61630   397335184
>59835        4     37820   2667888 3169
>Jul 23 20:23:29 lightning2 pqmon[15276]:  36018     1   61637   398231640
>59835        4     37820   1771432 3169
>Jul 23 20:24:29 lightning2 pqmon[15276]:  36027     1   61628   399151056
>59835        4     37820    852016 3169
>Jul 23 20:25:29 lightning2 pqmon[15276]:  36238     1   61417   399592472
>59835        4     37820    410600 3168
>Jul 23 20:26:29 lightning2 pqmon[15276]:  36216     1   61439   399991480
>59835        4     37820     11592 3163
>Jul 23 20:27:29 lightning2 pqmon[15276]:  36186     1   61469   399994920
>59835        4     37820      8152 3147
>Jul 23 20:28:29 lightning2 pqmon[15276]:  36057     1   61598   399999864
>59835        4     37820      3208 3137
>Jul 23 20:29:30 lightning2 pqmon[15276]:  35646     1   62009   399980208
>59835        4     37820     22864 3124
>Jul 23 20:30:30 lightning2 pqmon[15276]:  35283     1   62372   399994480
>59835        4     37820      8592 3112
>Jul 23 20:31:30 lightning2 pqmon[15276]:  35462     1   62193   400000696
>59835        4     37820      2376 3120
>Jul 23 20:32:30 lightning2 pqmon[15276]:  34906     1   62749   400000632
>59835        4     37820      2440 3077
>Jul 23 20:33:30 lightning2 pqmon[15276]:  34858     1   62797   399999192
>59835        4     37820      3880 3057
>Jul 23 20:34:30 lightning2 pqmon[15276]:  34290     1   63365   399996784
>59835        4     37820      6288 2959
>Jul 23 20:35:30 lightning2 pqmon[15276]:  33682     1   63973   399997352
>59835        4     37820      5720 2885
>Jul 23 20:36:30 lightning2 pqmon[15276]:  33596     1   64059   399998024
>59835        4     37820      5048 2861
>Jul 23 20:37:30 lightning2 pqmon[15276]:  32904     1   64751   399992952
>59835        4     37820     10120 2805
>Jul 23 20:38:30 lightning2 pqmon[15276]:  32380     1   65275   399999784
>59835        4     37820      3288 2708
>Jul 23 20:39:30 lightning2 pqmon[15276]:  32487     1   65168   399989456
>59835        4     37820     13616 2706
>Jul 23 20:40:30 lightning2 pqmon[15276]:  32657     1   64998   400001480
>59835        4     37820      1592 2717
>Jul 23 20:41:30 lightning2 pqmon[15276]:  32764     0   64892   400003072
>59835        4     37820         0 2737
>Jul 23 20:42:30 lightning2 pqmon[15276]:  32956     1   64699   399985120
>59835        4     37820     17952 2752
>
>The queue is just about filled up by 20:22, and that's when we see the
>problems start.
>
>I experimented with running pqexpire, running it @ 30 second intervals
>to keep only the last 30 minutes of data.  That cleared the
>queue, but I then found that each time pqexpire ran corresponded almost
>to the second to frame loss errors in the ingestor program.
>
>So it seems to me that the product queue cleanup process, whether it
>is run "automatically" in the modern LDM, or "the old way" using
>pqexpire, slows up pqing reading from the named pipes just enough
>so it can't keep up with the main ingestor program.  By the time
>the data is read from the pipe, some frames have lost their
>"window of opportunity" to get ingested.
>
>Any ideas as to how I might be able to solve this problem
>would be much appreciated.  I am sure that this has to have
>been done before by the outfits that use a Linux box to
>ingest data via the LDM.
>
>For what it's worth, we have the same problem on a much older
>PIII 600 MHz system running RH 6.1.
>
>Many thanks . . .
>
>--Kevin
>
>______________________________________________________________________
>Kevin Tyle, Systems Administrator               **********************
>Dept. of Earth & Atmospheric Sciences           address@hidden
>University at Albany, ES-235                    518-442-4571 (voice)
>1400 Washington Avenue                          518-442-5825 (fax)
>Albany, NY 12222                                **********************
>______________________________________________________________________
>