[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 20040630: 20040630: potential LDM/pqact problem on OSF/1



David,

>Date: Wed, 30 Jun 2004 10:08:00 -0700
>From: David Ovens <address@hidden>
>Organization: University of Washington
>To: Steve Emmerson <address@hidden>
>Subject: Re: 20040630: 20040630: potential LDM/pqact problem on OSF/1
>Keywords: 200406241954.i5OJsCWb010248 LDM PIPE Perl

The above message contained the following:

> I am looking in ~ldm/logs/ldmd.log*, I hope these are the correct
> files.

They are.

> Anyhow, these files seem to be written to by two machines
> sunny and glacier.

TWO machines.  That's rather odd.

> I am noticing the problems on glacier.  Here are
> the glacier entries surrounding a failure that occurred with a 1554
> radar file for today:
> 
> file sizes (.1 is from PERL, .2 is from Bourne-shell):
> -rw-r--r--   1 ldm      ldm      1088055 Jun 30 09:02 n0r_20040630_1554
> -rw-r--r--   1 ldm      ldm       262144 Jun 30 09:02 n0r_20040630_1554.1
> -rw-r--r--   1 ldm      ldm      1088055 Jun 30 09:02 n0r_20040630_1554.2

The following log entries don't seem to correspond to the above files
because the log entries refer to time "1545" rather than the above time
of "1554".

The log entries do indicate a problem, however.

> ldmd.log entries:
> Jun 30 16:02:02 glacier pqact[523476]: pbuf_flush (4) write: Interrupted 
> system call
> Jun 30 16:02:02 glacier pqact[523476]: pipe_dbufput: 
> -close/home/disk/ldm/local/bin/gini/zlib2gif.pl/home/glacier/ldm/nport/IMAGE/NHEM-COMP/24km/VIS/VIS_20040630_1545satz/ch1/GOES-12/VIS/200406301545/NHEM-COMP/24km
>  write error
> Jun 30 16:02:03 glacier pqact[523476]: pbuf_flush (5) write: Interrupted 
> system call
> Jun 30 16:02:03 glacier pqact[523476]: pipe_dbufput: 
> -close/home/disk/ldm/local/bin/gini/png2gif.pl/home/glacier/ldm/nport/RADAR/1km/n0r/n0r_20040630_1554
>  write error
> Jun 30 16:02:05 glacier pqact[523476]: pbuf_flush (4) write: Broken pipe
> Jun 30 16:02:05 glacier pqact[523476]: pipe_dbufput: 
> -close/home/disk/ldm/local/bin/gini/zlib2gif.pl/home/glacier/ldm/nport/IMAGE/PR-NATIONAL/8km/IR/IR_20040630_1545satz/ch1/GOES-12/IR/200406301545/PR-NATIONAL/8km
>  write error
> Jun 30 16:02:05 glacier pqact[523476]: pipe_prodput: trying again
> Jun 30 16:02:05 glacier pqact[523476]: pbuf_flush (4) write: Broken pipe
> Jun 30 16:02:05 glacier pqact[523476]: pipe_dbufput: 
> -close/home/disk/ldm/local/bin/gini/zlib2gif.pl/home/glacier/ldm/nport/IMAGE/PR-NATIONAL/8km/IR/IR_20040630_1545satz/ch1/GOES-12/IR/200406301545/PR-NATIONAL/8km
>  write error

> Is that what you were looking for?

Yup.

Apparently, the pqact(1) process is receiving a signal while it's trying
to write data to a pipe.  The signal shouldn't be a SIGCONT, because
that signal should be ignored at that time.  The signal also shouldn't
be a SIGALRM because that signal should cause a "pbuf_flush" log entry
to be made.  The signal also shouldn't be a SIGPIPE or SIGIO because
they should cause a different system error message to be logged.

I wonder what the signal is and who's sending it.

Can you put the pqact(1) process (pid 523476) into verbose logging mode
by having the LDM user send it a SIGUSR2, e.g.,

    kill -USR2 523476

Send me anything that looks relevant.

Alternatively, can I log onto your system as the LDM user?  That would
be great.

Sending the pqact(1) process another such signal will put it into debug
logging mode, which will greatly increase the number of log messages and
should be done with caution.

A third such signal will put the pqact(1) process back into regular
logging mode.

Regards,
Steve Emmerson