[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030428: 2 dcgrib2's running



Art,

A second dcgrib2 could start up from the same pqact.conf
entry if:

1) the iowait on your system were causing the output to backup such that
pqact couldn't push any more data down the PIPE. This is a likely
cause of corrupt files since both invocations would be actively writing.
Its possible that if the problem is increasingly frequent, then you
now are seeing greater loads- in particular, if your CONDUIT data
is now getting through quicker with the new LDM, you could expect higher
peak loads on your system.

2) the decoder has hit an infinite loop somewhere and has gotten 
stuck. If this were the case, you probably wouldn't see corrupt
data files since there would be no output coming from the decoder.

The [DM 0] error means that the maximum number of grids allowed in a
particular file needs to be increased. I can check the gribkey.tbl
entries for data sets where the number of products is growing.

You might need to split out your pqact.conf actions to separate
dcgrib2 entries. If you log your system performance, you might check
to see if iowait is increasing lately, and see if any other
processes areusing more than their normal share of CPU/IO.

Steve CHIswell


>From: "Arthur A. Person" <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200304281841.h3SIfw7U023762

>Hi...
>
>I've been seeing 2 dcgrib2's showing up in a "top" lately which causes our
>grib data collection to faulter.  The only clue I see in ldmd.log is:
>
>Apr 28 17:11:19 ls1 pqact[2486]: pbuf_flush 4: time elapsed   2.157145
>Apr 28 17:12:09 ls1 pqact[2486]: pbuf_flush 4: time elapsed   2.740808
>Apr 28 17:12:35 ls1 pqact[2486]: pbuf_flush 4: time elapsed   3.482053
>Apr 28 17:16:00 ls1 pqact[2486]: pbuf_flush (4) write: Broken pipe
>Apr 28 17:16:00 ls1 pqact[2486]: pbuf_flush 4: time elapsed   4.050873
>Apr 28 17:16:00 ls1 pqact[2486]: pipe_dbufput:
>/home/gempak/NAWIPS/bin/linux/dcgrib2-ddata/gempak/lo
>gs/dcgrib2.log-eGEMTBL=/home/gempak/NAWIPS/gempak/tables write error
>Apr 28 17:16:00 ls1 pqact[2486]: pipe_prodput: trying again
>Apr 28 17:16:00 ls1 pqact[2486]: child 1442 terminated by signal 11
>Apr 28 17:16:08 ls1 pnga2area[19021]: Starting Up
>Apr 28 17:16:08 ls1 pqact[2483]: pbuf_flush 5: time elapsed   2.528829
>Apr 28 17:16:08 ls1 pnga2area[19021]: unPNG::    75863    242800  3.2005
>Apr 28 17:16:08 ls1 pnga2area[19021]: Exiting
>
>In the dcgrib2.log file around this time I see:
>
>[1442] 030428/1311 [DM 0]
>[1442] 030428/1311 [DM 0]
>[1442] 030428/1311 [DM 0]
>[1442] 030428/1311 [DM 0]
>[1442] 030428/1311 [DM 0]
>[1442] 030428/1311 [DM 0]
>[18248] 030428/1316 [DC 3]
>[18248] 030428/1316 [DC -11]
>[18248] 030428/1318 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1331 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1336 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1338 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1340 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1342 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1342 [DCGRIB -50] griblen [1111142] > maxgribsize [1000000]
>[18248] 030428/1342 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1342 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1342 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>[18248] 030428/1342 [DCGRIB -55] Grid too large 1830 x 918 [1679940 >
>400000]
>
>Any clues what's causing the second dcgrib2 to start up?  This seems to
>have become more frequent in the last few weeks.  This is on RH 7.2 with
>gempak 5.6.h.
>
>                                 Thanks.
>
>                                   Art.
>
>Arthur A. Person
>Research Assistant, System Administrator
>Penn State Department of Meteorology
>email:  address@hidden, phone:  814-863-1563
>