[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20001218: write errors to mcidas directory after 17:52 today



>From: Robert Mullenax <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200012190130.eBJ1Uqo23841

Robert,

>Even though I have data flowing and the GEMPAK decoders are writing
>output, I am getting constant write error from the McIDAS
>XCD DDS and HRS decoders.

Where were you seening the errors?

>Data stopped being decoded at 17:52
>today (I know there was a McIADS feed problem)

Yes, but that was earlier, and the XCD decoders do not work with the
Unidata-Wisconsin image products; the ldm-mcidas decoders do.

>and now
>I have deleted the queue, made a new one, checked for permission
>problems, stopped and restarted the LDM, but no dice..still continuous
>write errors.

It would have been nice to see a sample of those errors.

>This is on the same disk that the GEMPAK data is being
>written to and there were no changes at all to the system.

Weird.

>It just stopped working.  This is on our Sparc system
>psnldm.nsbf.nasa.gov.
>
>Help!!?

More below.

>From address@hidden Mon Dec 18 18:38:51 2000

>Okay after doing an ldm stop again ( third time) and an ldm clean
>it is working now.  The question is what happened in the first place..

That was what I was going to ask.

>I saw this the other day on the x86 system in New Mexico.  I remade the
>queue and that fixed it. The SPARC is running McIDAS-X 7.6/ldm-5.1.2/
>Solaris 7 and the x86 Solaris 8 with the same Unidata versions.

The XCD decoders should not care about the LDM queue.  The sequence of
events is:

o LDM gets products from upstream sites
o pqact sends products to either ingebin.k or ingetext.k depending on
  what kind of products we are talking about (binary/HRS or textual/DDS)
o ingebin.k and ingetext.k write the products they get from pqact to
  a spool: ingetext.k to the daily .XCD file; ingebin.k to HRS.SPL
o the XCD data monitors work their way through the spool to decode
  products into McIDAS files

The write error would have to come from ingetext.k and/or ingebin.k
having execute problems or not being able to write to their respective
spool files.  Are you sure that no changes were made to the McIDAS
binaries during this process?

Tom

>From address@hidden Mon Dec 18 19:38:37 2000

Sorry, Tom I did not give you much to work on.  I am in a slight
panic mode trying to get things ready for Australia..(this working
two jobs thing can get hectic).  Here is what I saw this evening
in ldmd.log after I saw the errors in ldmd.log and stopped
and started the LDM after remaking the queue.  Later on I
got HDS errors as well:


Dec 19 01:00:00 psnldm 140.172.240.73[3163]: run_requester: 
20001218233000.988 T
S_ENDT {{HDS,  ".*"}}
Dec 19 01:00:00 psnldm cirrus[3165]: run_requester: Starting Up: 
cirrus.al.noaa.
gov
Dec 19 01:00:00 psnldm cirrus[3165]: run_requester: 20001218233000.999 
TS_ENDT {
{FSL2|IDS|DDPLUS,  ".*"},{MCIDAS,  "^pnga2area Q[01]"
Dec 19 01:00:01 psnldm pqact[3159]: child 3164 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3161 exited with status 127
Dec 19 01:00:01 psnldm 140.172.240.73[3163]: FEEDME(140.172.240.73): OK
Dec 19 01:00:01 psnldm cirrus[3165]: FEEDME(cirrus.al.noaa.gov): OK
Dec 19 01:00:01 psnldm pqact[3159]: child 3167 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3169 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3171 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3173 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3175 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3177 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3180 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3183 exited with status 127
Dec 19 01:00:01 psnldm pqact[3159]: child 3185 exited with status 127
Dec 19 01:00:02 psnldm pqact[3159]: pbuf_flush (4) write: Broken pipe
Dec 19 01:00:02 psnldm pqact[3159]: pipe_dbufput: xcd_runDDS write error
Dec 19 01:00:02 psnldm pqact[3159]: pipe_prodput: trying again
Dec 19 01:00:02 psnldm pqact[3159]: child 3187 exited with status 127
Dec 19 01:00:02 psnldm pqact[3159]: child 3189 exited with status 127
Dec 19 01:00:02 psnldm pqact[3159]: child 3191 exited with status 127
Dec 19 01:00:02 psnldm pqact[3159]: child 3193 exited with status 127
Dec 19 01:00:02 psnldm pqact[3159]: child 3195 exited with status 127
--More--(0%)

Shortly after this I did an ldmadmin stop,clean, start and
all is well now (and continues to be fine).  The strange
thing is that even though DDS is running (a ps -eaf shows
ingetext.k DDS and I have new obs) the XCD_START.LOG
in ~mcidas/workdata only shows the HRS starting.  I am
sure no changes were made to the system.. It just started
spewing errors and then stopped after doing an ldmadmin clean
after having stopped and started a couple of times.  I checked
the inge*.k binaries and thet were from April 28, 2000 and
have not been messed with.  So I am really stumped.


Robert

>From address@hidden Mon Dec 18 20:21:40 2000

Tom,

I went over to wxmcidas which was fine and found it was doing the
same thing now, after I switched it's feed to other
than psnldm.  I have found the problem.  After doing ldmadmin
stop a couple of times and clean, I did an ldmadmin ps which
said no ldm running, etc..  However look at this:
all 1,042 messages.
/usr/local/ldm/logs% ps -lu ldm
F S   UID   PID  PPID  C PRI NI     ADDR     SZ    WCHAN TTY      TIME CMD
8 S  1002  1671  1670  0  99 20 e17347d8    652 e1734844 ?        0:00 
startxcd
8 S  1002  1501  1498  0  40 20 e172f7e0  87231 e172fa0c ?        0:03 
pqbinsta
8 S  1002  1682  1671  0  40 20 e1b34140    652 e15b4bf6 ?        0:13 
startxcd
8 S  1002  1673  1670  0  40 20 e17a47f8  87252 e17a4a24 ?       12:39 
pqbinsta
8 S  1002  1687  1682  0  40 20 e17230a8   4669 e15a00f6 ?       782:51 
dmgrid.k
8 R  1002 24678 14922  0  51 20 e19e6860    484          pts/2    0:00 tcsh
8 S  1002 18637  1682  0  40 20 e1b81158    920 e19aa6f6 ?        0:02 
dmmisc.k
8 S  1002 17789  1682  0  40 20 e1eb4860    918 e126d676 ?        0:15 
dmsfc.k
8 S  1002 18635  1682  0  41 20 e1b52840    902 e195fb96 ?        0:09 
dmsyn.k
8 S  1002 22471  1682  0  40 20 e0dd7760    854 e19f93d6 ?        0:02 
dmraob.k

I can't kill these off except by killing them one by one.
I have had trouble with ldm-5.1.2 getting it to stop, but have
not seen this before.

Robert