[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #XND-654237]: Can viewing ldm.pq cause queue corruption?



Justin,

> Here at NCEP we are running ldm-6.7.0 on our IBM P6 cluster supercomputer 
> (running AIX 5.2). We have been running this version of LDM for over a year 
> with no trouble, but this morning our ldm processes began to fail.
> 
> First all of our requests and downstream feeds stopped:
> Aug 04 15:26:41 c1n5 local0:info pqact[237704] INFO:    34884 
> 20110804152641.591 NEXRAD2 124004  L2-BZIP2/KTBW/20110804152630/124/4/I/V03/0
> Aug 04 15:26:41 c1n5 local0:info pqact[237704] INFO:                file: 
> -close        
> /dcom/us007003/ldmdata/obs/upperair/nexrad_level2/KTBW/KTBW_20110804_152630.bz2
> Aug 04 15:26:42 c1n5 local0:info pqing[483668] INFO:    14867 
> 20110804152637.206 FNEXRAD 023  SXUS72 KWBC 041524
> Aug 04 15:26:45 c1n5 local0:info pqing[459286] INFO:     1358 
> 20110804152640.008     HDS 617  SDUS41 KGYX 041517 /pRCMGYX
> Aug 04 15:27:37 c1n5 local0:info c3n3p.ncep.noaa.gov[913510] INFO: Connection 
> from c3n3p.ncep.noaa.gov
> Aug 04 15:27:37 c1n5 local0:info c3n3p.ncep.noaa.gov[913510] INFO: Exiting
> Aug 04 15:27:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 913510 exited 
> with status 0
> Aug 04 15:28:37 c1n5 local0:info c3n3p.ncep.noaa.gov[918520] INFO: Connection 
> from c3n3p.ncep.noaa.gov
> Aug 04 15:28:37 c1n5 local0:info c3n3p.ncep.noaa.gov[918520] INFO: Exiting
> Aug 04 15:28:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 918520 exited 
> with status 0
> Aug 04 15:29:37 c1n5 local0:info c3n3p.ncep.noaa.gov[278546] INFO: Connection 
> from c3n3p.ncep.noaa.gov
> Aug 04 15:29:37 c1n5 local0:info c3n3p.ncep.noaa.gov[278546] INFO: Exiting
> Aug 04 15:29:37 c1n5 local0:info rpc.ldmd[484050] INFO: child 278546 exited 
> with status 0

The connection from host "c3n3p" and termination of the associated upstream LDM 
are odd but don't indicate a problem with the LDM on host "c1n5". I suspect 
that the TCP connection is being destroyed by the network by, for example, a 
firewall rule.

> Aug 04 15:29:42 c1n5 local0:info 140.90.85.102[578372] INFO:    25548 
> 20110804152641.639 NEXRAD2 697049  L2-BZIP2/KOHX/20110804152305/697/49/I/V03/0
> Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR: readtcp(): 
> EOF on socket 4
> Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR: 
> one_svc_run(): RPC layer closed connection
> Aug 04 15:29:42 c1n5 local0:err|error 140.90.85.102[578372] ERROR: 
> Disconnecting due to LDM failure; Connection to upstream LDM closed
> Aug 04 15:29:42 c1n5 local0:notice 140.90.85.102[578372] NOTE: LDM-6 desired 
> product-class: 20110804142942.773 TS_ENDT {{NEXRAD2,  ".*"},{NONE,  
> "SIG=1ad0d1658b39396d3715d4eafd4f9f9e"}}
> ...

The termination of the downstream LDM process 578372 was due to a closed TCP 
connection to upstream host 140.90.85.102. Again, I suspect the network. 
Because the downstream LDM process will repeatedly try to reestablish the 
connection to the upstream LDM, this temporary disconnection should matter 
providing the underlying cause was transitory.

> Then our two pqact processes exited:
> Aug 04 15:30:00 c1n5 local0:notice pqact[790878] NOTE: Exiting
> Aug 04 15:30:00 c1n5 local0:notice pqact[237704] NOTE: Exiting
> Aug 04 15:30:00 c1n5 local0:notice pqact[237704] NOTE: Behind by 17.3867 s
> Aug 04 15:30:00 c1n5 local0:notice pqact[790878] NOTE: Behind by 17.4668 s
> Aug 04 15:30:00 c1n5 local0:info rpc.ldmd[484050] INFO: child 237704 exited 
> with status 0: pqact -f CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.craft
> Aug 04 15:30:00 c1n5 local0:info rpc.ldmd[484050] INFO: child 790878 exited 
> with status 0: pqact -f ANY-CRAFT -v -o 900 /iodprod/dbnet/ldm/etc/pqact.conf

This is very odd. The only ways a pqact(1) process can terminate with a status 
of 0 are to have reached the end of the queue after being started with the "-i 
0" option (unlikely) or to have received a TERM signal, which could only have 
been sent by the LDM user or the superuser.

> Our monitoring began to flag indicating the pqact had stopped running. We 
> logged in and attempted to stop ldm by hand, but our cron entries to restart 
> LDM were already trying to restart the software so we ended up giving the LDM 
> processes a kill -9, remade the queue, and restarted the LDM. Everything has 
> been running fine since.
> 
> Around the time of the failure we know a 'more' was done to the ldm.pq file 
> by a userid that does not have write permission to the ldm.pq file. Could 
> viewing the ldm.pq file cause queue corruption or some other situation which 
> explains why our LDM began to fail?

I'm afraid that I don't see how a read-only more(1) of the product-queue could 
cause a pqact(1) process to terminate, much less with a status of 0. This 
assumes, of course, that the operating-system behaves correctly.

> We run this version of ldm on both of our supercomputers, both are receiving 
> an independent feed of the same observational data, however the failure only 
> occurred on the system where the more was performed. All of the evidence is 
> pointing to that being the cause, but before we pin the failure on it we just 
> wanted to run this by you.

Good idea. Always glad to help.

> Thanks,
> 
> Justin Cooke
> NCEP Central Operations

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: XND-654237
Department: Support LDM
Priority: Normal
Status: Closed