[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the saga continues



On Mon, 19 Aug 2002, Benjamin Cotton wrote:

> Anne,
>  
> I made the changes to ldmd.conf, ldmadmin and pqact.conf that you
> suggested.  I have logs now, but still no satellite data.  The rest of
> my data is either incomplete and/or a day late.  The interesting thing
> is that the incoming data is only about 20 seconds late.  So something
> is getting lost in the shuffle.  I knew I shoulda got luggage tags.
> Haha anyway? well at least we?re getting somewhere.
>  
> Ben
>  
> P.S. My horoscope for today read in part: ?The world is your oyster.?  I
> just can?t escape those oysters...
>  
> ===================
> Benjamin J. Cotton
> LDM Administrator
> Department of Earth and Atmospheric Science,
> Purdue University
>  
> 165 Cary Quadrangle                          cell: (502) 551-5403
> West Lafayette, IN  47906         campus: (765) 49-52298
>  
> address@hidden
> www.eas.purdue.edu/~bcotton
>  
> 

Hi Ben,

I'm glad that the logging is working now.  Did you see these messages in 
the log?  

Aug 20 16:10:59 flood[16519]: run_requester: 20020820154739.805 TS_ENDT 
{{NNEXRAD|DIFAX|UNIDATA,  ".*"}}
Aug 20 16:10:59 flood[16519]: FEEDME(flood.atmos.uiuc.edu): OK
Aug 20 16:11:00 flood[16519]: pq_del_oldest: conflict on 40654520
Aug 20 16:11:00 flood[16519]: hereis: pq_insert failed: Resource 
temporarily unavailable: 68934acfb1d8e490a644914e27bbe686     8539 
20020820154900.986 NNEXRAD 030  SDUS53 KMQT 201546 /pN0RMQT
Aug 20 16:11:00 flood[16519]: Connection reset by peer
Aug 20 16:11:00 flood[16519]: Disconnect

YOur connection to flood is continually being broken and reestablished, 
presumably because anvil's disk is unavailable.  This can be caused by 
a full disk, but your disk isn't full:

(anvil.eas.purdue.edu) [/project/ldm]% cd data
(anvil.eas.purdue.edu) [/project/ldm/data]% df -k .
Filesystem   1K-blocks     Used    Avail Capacity  Mounted on
/dev/ad0s1g   20202730 15325911  3260601    82%    /net/anvil

So, back to this problem in a moment.

I see that you are requesting lots of data from flood:

request DIFAX|UNIDATA|NNEXRAD ".*" flood.atmos.uiuc.edu


Do you really want or need the entire NEXRAD feed?  I see that you're 
filing only the N0R products.  If you don't need the entire feed for 
relay purposes I strongly suggest that you only request the N0R products 
from flood as that is a small percent of the entire feed. 

I see that there are lots of 'find' processes running on anvil:

(anvil.eas.purdue.edu) [/project/ldm/etc]% ps -ax | grep find
  875  ??  D     48:19.36 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
 1795  ??  D     27:29.13 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
 6277  ??  D     69:01.29 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
 8663  ??  D    242:41.01 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
10368  ??  D     83:40.45 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
10826  ??  D    199:54.24 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
12099  ??  DN    61:37.30 find -s / ! ( -fstype ufs ) -prune -or -path 
/tmp -prune -or -path /usr/tmp -prune -or -path 
12588  ??  D    144:34.92 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
13471  ??  D    261:46.83 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
13691  ??  D    105:58.58 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
17146  ??  D    117:23.19 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -
23366  ??  DN   184:42.46 find -s / ! ( -fstype ufs ) -prune -or -path 
/tmp -prune -or -path /usr/tmp -prune -or -path 
92142  ??  D      8:20.79 find /net/anvil -xdev -type f ( -perm -u+x -or 
-perm -g+x -or -perm -o+x ) ( -perm -u+s -or -

These processes are owned by root.  I 
don't know how these are being generated.  Are they for security 
reasons?  Are they necessary?  Some have been running for many hours.  On 
a full disk like /net/anvil, they can be *very* disk intensive, and could 
contributed the the disk being unavailable. 

Until I noticed that the find processes were owned by root I thought 
perhaps the find was coming from the LDM scour program, so I 
checked the LDM crontab to see how scour was being invoked.  I found this:

(anvil.eas.purdue.edu) [/project/ldm/etc]% crontab -l | grep scour
35 * * * * /project/ldm/bin/scour_anvil > /dev/null

I looked at scour_anvil - it's running your own scour program, so I have 
no idea what that's doing.  Could that be invoking the 'find' processes 
above??

Given the delays you're experiencing with the filed data, I wondered if 
pqact was keeping up. So, I put it in verbose mode, and grabbed this out 
of the log:

Aug 20 17:57:21 pqact[16518]:      149 20020820140009.143     HDS 121  
NXUS65 KPSR 201359 /pGSMIWA
Aug 20 17:57:21 pqact[16518]:     7332 20020820140009.152 NNEXRAD 122  
SDUS25 KABQ 201357 /pN1SABX
Aug 20 17:57:21 pqact[16518]:     6158 20020820140009.154 NNEXRAD 123  
SDUS23 KGRB 201352 /pN2RGRB
Aug 20 17:57:21 pqact[16518]:    18639 20020820140009.171 NNEXRAD 124  
SDUS54 KFWD 201351 /pNCRGRK
Aug 20 17:57:21 pqact[16518]:     1625 20020820140009.311     HDS 130  
SDUS83 KMKX 201351 /pDPAMKX
Aug 20 17:57:21 pqact[16518]:     5429 20020820140009.186 NNEXRAD 125  
SDUS33 KMKX 201351 /pN3RMKX
Aug 20 17:57:21 pqact[16518]:     8470 20020820140009.188 NNEXRAD 126  
SDUS24 KAMA 201358 /pN1RAMA
Aug 20 17:57:21 pqact[16518]:     5685 20020820140009.203 NNEXRAD 127  
SDUS75 KABQ 201357 /pN1VABX
Aug 20 17:57:21 pqact[16518]:     9936 20020820140009.205 NNEXRAD 128  
SDUS56 KSGX 201357 /pN0RNKX

This shows us that pqact is running about 3 hours behind - it's not able 
to keep up with the volume of data.  Perhaps killing the 'find' processes 
will free up some of the disk for pqact.  


I also think that your 300MB queue is too small for  what you are 
requesting.  If you look at the data volumes in your workshop 
binder, you'll see that on that day the NEXRAD max was 150MB.  (Over the 
past 24 overs the NEXRAD max was 206MB.)  Let's just look at the math 
based on the volumes in the workshop notebook:

NEXRAD          150
HDS             141
IDS|DDPLUS        5.5
DIFAX             7
MCIDAS            7

So you would be unable to keep an hour's worth of data in a 300MB queue.

However, pqmon is reporting that you have several hours worth of data in 
your queue (see the "age" column, the age of the oldest product in the 
queue in seconds): 


(anvil.eas.purdue.edu) [/project/ldm/data]% pqmon -i2
Aug 20 17:15:47 pqmon: Starting Up (29021)
Aug 20 17:15:47 pqmon: nprods nfree  nempty      nbytes  maxprods  maxfree  
minempty    maxext  age
Aug 20 17:15:47 pqmon:  40458     1   32783   299996608     56208        7     
17033      6720 10350
Aug 20 17:15:49 pqmon:  40458     1   32783   299996608     56208        7     
17033      6720 10352
Aug 20 17:15:51 pqmon:  40458     1   32783   299996608     56208        7     
17033      6720 10354


I assume this is because writes to the queue aren't succeeding, based on 
the 'resource temporarily unavailable' message.

So, in summary, I recommend killing all the 'find' processes and figuring 
out where they are coming from and whether they're necessary. I also 
recommend cutting back on your NEXRAD request, or, if you must have 
it all, then using a bigger queue, at least 500MB.  Then if these don't 
solve the problem I would stop filing so much data and see if the results are 
better.  Then you can gradually add in more filing until you find the 
threshold where things start falling apart.
 
This is a little scattered. Please let me know if you have any questions.

Anne
 -- 
***************************************************
Anne Wilson                     UCAR Unidata Program            
address@hidden                 P.O. Box 3000
                                  Boulder, CO  80307
----------------------------------------------------
Unidata WWW server       http://www.unidata.ucar.edu/
****************************************************