[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #HCO-891524]: Unidata Performance Question



Mike,

Yes, if you have to kill LDM processes like that then you should execute the 
command "ldmadmin clean" *and* re-create the product-queue because it's likely 
been corrupted.

> Steve,  Thanks for looking at this head scratcher.
> 
> I suspect I corrupted to product queue when I killed the pqinsert that had 
> been running for over 24 hours.  When I did a top I had the single process 
> that had accumulated over 24 hours of cpu time.  I tried sending a TERM 
> signal to the process.  I waited a couple of minutes and when that didn’t 
> take I issued a KILL to shut it down.  I hoped that after that the other 
> inserts would start running but it never happened.  I then tried to shut down 
> the ldmd through ldmadmin but it never terminated.  I started to then 
> systematically kill all the pqinserts still waiting.  Once I got them all 
> killed ldmd shut down successfully.  After that I tried to run the pqcheck 
> and that’s when I had to wait over 40 minutes for a check that never 
> finished.  In retrospect reading the instructions again I think I should have 
> run the clean option through ldmadmin.
> 
> The ldm system is running via a start command.  While the perl script 
> executes is a good question.  There is nothing to prevent the ldmadmin start 
> from occurring while the scripts are running.  The perl scripts run on a cron 
> to pull data from external gps receivers.  I guess that is something to 
> consider.
> 
> During this instance the ldm was up and running before the cron jobs were 
> started and this product queue had been populating for over a week before we 
> hit this snag.
> 
> There shouldn’t be any problems with power on the servers.  They’re all UPS 
> protected, with a generator as a secondary electrical to regular utilities.
> 
> This seemed really strange that the pqinsert got stuck on a single file 
> trying to insert it.  As I say I don’t have any good theories on what may 
> have occurred, other than to say I hope it’s a one time cosmic pixie dust 
> anomaly that never happens again.
> 
> I still suspect it might be a file access error, that the pqinsert was called 
> before the file was fully written out.  I’m looking at building a more robust 
> way to see that the system is done with the file before it tries to call 
> pqinsert.  I’m looking at deeper system level calls to see that the OS is 
> done writing out the file than simply monitoring the mod time of the file.
> 
> Again Thanks for lending your expertise.  At least I know I’m not missing 
> something completely obvious.
> 
> -Mike

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: HCO-891524
Department: Support LDM
Priority: Normal
Status: Closed