Justin, >Date: Fri, 14 Oct 2005 08:39:44 -0400 >From: Justin Cooke <address@hidden> >Organization: NOAA/FSL >To: Steve Emmerson <address@hidden> >Subject: Re: "pbuf_flush: time elapsed" problem The above message contained the following: > >How do you know that NEXRAD2 data-products stopped being inserted into > >the product-queue around 15:21 UTC? > > > Oct 09 15:21:56 b2n1 188.8.131.52 INFO: > c3fc16cf594ab8cf9d5cfeaf5c91941d 9114 20051009152154.075 NEXRAD2 2030 > L2-BZIP2/KSFX/20051009151924/2/30 > Oct 09 15:21:56 b2n1 pqact INFO: 1147 20051009152153.496 FNEXRAD > 035 SNCN19 CWAO 091519 > Oct 09 15:21:56 b2n1 pqact INFO: file: > /dcomdev/us007003/ldmdata/obs/surface/land/20051009.land > Oct 09 15:21:56 b2n1 pqact INFO: pipe: > bin/decod_dclsfc -v 2 -t 300 -d > /dcomdev/us007003/decoder_logs/decod_dclsfc.log > /dcomdev/us007003/bufrtab.000 tables/stns/lsfc.tbl > tables/decod_WMO.Res40.headers > Oct 09 15:21:56 b2n1 184.108.40.206 INFO: > 29416a274c2aabe7a6c45009e0173dcf 8349 20051009152154.233 NEXRAD2 523005 > L2-BZIP2/KGGW/20051009152121/523/5 > > These are the last entries for the rpc.ldmd process  that is > putting NEXRAD2 products into the queue, there is nothing else in the > log for PID 1622180 after that. I expect, then, that an strace(1) of downstream-LDM process 1622180 would have shown that it's hung-up in making futex(2) FUTEX_WAIT calls that never return. > >Were data-products of feedtypes other than NEXRAD2 also being inserted > >into the product-queue? > > > Yes, we were still receiving PCWS and FSL2 feedtypes: > > Oct 09 15:22:39 b2n1 eldm4 INFO: 727508 20051009152237.671 PCWS > 000 FSL.CompressedNetCDF.MADIS.acars.20051009_1300.gz > Oct 09 15:26:46 b2n1 eldm4 INFO: 34605 20051009152616.747 FSL2 > 000 FSL.CompressedNetCDF.MADIS.coop.20051009_1500.gz That would explain the output of the pqact(1) process that indicated some data-products didn't match its selection-criteria. > As a test I have put the rpc.ldmd for NEXRAD2 into silent mode (it has > always been in verbose mode) to see if it has any effect. When we were > in version 6.0.15 putting that feed into silent mode seemed to "fix" > whatever was causing the stoppage. As far as I've been able to determine, the problem is caused by this: ... The output shows that the top-level LDM is calling the futex(2) system call using the FUTEX_WAIT operation (i.e., it's waiting on a mutex). The wait is interrupted by a SIGCONT signal, which occurs everytime a data-product is inserted into the product-queue. The thread then returns to whatever runtime function made the futex(2) call (the LDM code doesn't call that function directly) whereupon the runtime function again calls futex(2). The problem is that the futex(2) call never returns due to a FUTEX_WAKE operation on the given futex. Unfortunately, we're not currently equiped to debug the Linux futex(2) call -- which is what it would take to diagnose this problem. What version of the kernel are you running? Regards, Steve Emmerson
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.