[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20000313: LDM problem



David,

It sounds to me that dcgrib may be in an infinite loop, because in general, 
there should
only be one copy of dcgrib running for the ruc2 data. Since you are seeing 
multiple copies
running (that aren't exiting), the ldm is having to open another pipe to the 
decoder
since the previous one isn't emptying out the pipe anymore (its pipe is full).
I'm not seing that behavoir here on our live X86 ldm from the conduit data- so 
it
may be a specific problem to linux.

I have two possible solutions-

1) I copied a fresh "unstripped" linux executable for dcgrib into 
~gbuddy/nawips-5.4/binary/linux
   The one in the binary tarfile doesn't have the debug flags, and is stripped.
   Can you issue a "kill -6" to dcgrib after it had been stuck for a long time?
   That should cause the decoder to dump core (presumably in your ~ldm home
   directory where the ldm was launched from). 
   Once the core file is created, then running 
   "gdb ~ldm/decoders/dcgrib ~ldm/core" and then the command "where"
   may tell us in what routine the decoder is hung.

2) You can try running my dcgrib2 from Gempak 6 and see if it solves you 
problem-
    though I'd like to fix the first problem if possible. I've re-written the 
    decoder to use the gb gemlib routines rather than the old gribtonc 
    so all that old code is gone.


Steve Chiswell


>From: David Wojtowicz <address@hidden>
>Organization: .
>Keywords: 200003131812.LAA05784

>
>Hi,
>
> We're having a LDM problem, specifically with pqact.  It seems that it
>is not terminating decoder processes.   For example, we have a GEMPAK
>dcgrib decoding the hourly RUC output.  After 24 hours, there's
>a processes for each of the hourly files, one started at 00Z, one
>started at 01Z, 02Z, etc.  Add the other decoder invocations to
>this and the machine quickly fills up with many many decoder processes
>still running.... with great difficulty at that point...the load
>average is over 90.0!!  They all go away once you type ldmadmin stop.
>
>
> The PIPE commands in the pqact.conf file don't specify the -close flag so
>that when there is a steady stream of WMO bulletins destined for that
>decoder it can process them all at once instead of starting up a separate
>decoder for each one. It is my understanding that in this case, that
>action remains open until a maximum limit of open file descriptors is
>reached at which point the least recently used is closed.  
>This doesn't seem to be happening though.  It would appear that
>the maximum number of entries is defined as 32 in MAXENTRIES
>in pqact, but we see over 90 of them... probably limited by
>the OS limit on maximum number of processes.
>
> This problem is only being observed on machines with a handful
>of actions in their pqact file.   Our main LDM machine has
>a large pqact.file with many FILE entries in addition to
>a good number of PIPE commands to GEMPAK decoders (for the
>HRS stream).  This machine experiences no problem of this sort.
>
>  The machine which we are using again exclusively for CONDUIT ingest only
>has a handful of pqact entries for invoking the GEMPAK decoders on the
>CONDUIT data.  This machine experiences the problem.  I'd blame
>it on dcgrib, except that we also have dcgrib running on the other
>machine without the problem.  We also have the problem
>on another machine that pipes to a simple shell script but likewise
>has few pqact entries, none of them GEMPAK related.
>
> The machines in question are all running RedHat Linux 6.0/6.1
>with LDM v5.0.9.
>
>
> Here's the pqact file:  (broken up somewhat by e-mail word wrap)
>
># GEMPAK pattern/actions
>#
># RUC2/MAPS
>NMC2    ruc2
>        PIPE    /usr/local/packages/gempak/NAWIPS-5.4/bin/linux/dcgrib -d 
>/home/
>flood/ldm2/runtime/logs/dcgrib_nmc2.log
>                -t 60 -g 
>/usr/local/packages/gempak/NAWIPS-5.4/gempak5.4/tables
>-m 7500
>                PACK /home/flood/data/conduit/YYMMDDHH_maps2_grid@@@.gem
>#
># ETA grids
>NMC2    eta
>        PIPE    /usr/local/packages/gempak/NAWIPS-5.4/bin/linux/dcgrib -d 
>/home/
>flood/ldm2/runtime/logs/dcgrib_uswrp.log
>                -t 60 -g 
>/usr/local/packages/gempak/NAWIPS-5.4/gempak5.4/tables
>-m 9000
>                PACK /home/flood/data/conduit/YYMMDDHH_eta_grid@@@.gem
>#
># MRF grids
>NMC2    mrf/(mrf|ens).(......)/drfmr.T(..)Z
>        PIPE    /usr/local/packages/gempak/NAWIPS-5.4/bin/linux/dcgrib -d 
>/home/
>flood/ldm2/runtime/logs/dcgrib_uswrp.log
>                -t 60 -g 
>/usr/local/packages/gempak/NAWIPS-5.4/gempak5.4/tables
>-m 9000
>                PACK /home/flood/data/conduit/YYMMDDHH_mrf_grid@@@.gem
>#
># Ensemble mrf grids
>NMC2    ens/ens.(......)/tar
>        PIPE    /usr/local/packages/gempak/NAWIPS-5.4/bin/linux/dcgrib -d 
>/home/
>flood/ldm2/runtime/logs/dcgrib_uswrp.log
>                -t 60 -g 
>/usr/local/packages/gempak/NAWIPS-5.4/gempak5.4/tables
>-m 9500
>                PACK /home/flood/data/conduit/YYMMDDHH_ens_grid@@@.gem
>#
># AVN grids
>NMC2    avn/avn.(......)/gblav.T(..)Z
>        PIPE    /usr/local/packages/gempak/NAWIPS-5.4/bin/linux/dcgrib -d 
>/home/
>flood/ldm2/runtime/logs/dcgrib_uswrp.log
>                -t 60 -g 
>/usr/local/packages/gempak/NAWIPS-5.4/gempak5.4/tables
>-m 9000
>                PACK /home/flood/data/conduit/YYMMDDHH_avn_grid@@@.gem
>#
># Status message logging
>NMC2    ^.status\.(.*) [0-9][0-9][0-9][0-9][0-9][0-9]
>        FILE    -close  /home/flood/ldm2/runtime/logs/%Y%m%d%H_nmc2.status
>
>
>
>--------------------------------------------------------
> David Wojtowicz, Research Programmer/Systems Manager
> Department of Atmospheric Sciences Computer Services
> University of Illinois at Urbana-Champaign
> email: address@hidden  phone: (217)333-8390
>--------------------------------------------------------
>
>
>
>