[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #USJ-914724]: LDM on Snow Leopard



Dave,

> Yeah, I might have been the one who originally encountered this
> problem and asked you'all for help with it.

You are (now that you reminded me -- it's truly been a while).

> Apple's lack of response
> has been disappointing.

To say the least.

> Maybe they'll be more responsive during this
> period immediately following Snow Leopard's release, while they try to
> get the bugs out of it. (Either that, or they'll be more overwhelmed
> than usual with bug reports.)

From your fingers to their eyes.  :-)

I suspect that there just aren't that many programs running on Mac OS X 10.4 
(or higher) systems that use fcntl() file-locking and mmap() memory-mapping as 
much as the LDM.

> Unidata's "known bug" entry for this problem notes that there is no
> workaround. That's true in a sense, but if it were strictly true then
> I'd never be able to stop the LDM at all, even when processes hang
> (which eventually some of them do on a semi-regular basis). 

I think a hung downstream LDM will, nevertheless, terminate upon reception of a 
SIGTERM, which is what the top-level LDM server sends all child processes when 
it's told to terminate.  I could be wrong, however.  One thing I have noticed 
is that attaching to the hung process with gdb(1) and then exiting gdb(1) will 
free the process from its hung state.  I'm at a loss to understand how that 
happens without intervention by the operating system.

> I've
> written scripts that try to deal with the inability to run "ldmadmin
> stop" to stop the LDM; maybe you could comment on whether or not I've
> got the bases covered acceptably:
> 
> (1) Run "ldmadmin stop", redirecting the output to a file.
> 
> (2) Check that file for the word "isn't" (as in "the LDM isnt
> running", or something like that).
> 
> (3a) If the LDM isn't running, check to see if there's a ldmd.pid
> file.
> -- If there's no ldmd.pid file, run pqcat & pqcheck, then run
> "ldmadmin clean".
> 
> (3b) If the LDM is running, wait 30 seconds to give it a chance to
> shut down. (This is typically doomed, at least for some rpc processes.)

Maybe you should give it a minute.

> (4) Get a list of pids for rpc processes owned by the ldm account.
> 
> (5) If this list isn't empty:
> --  Run "kill -9" on them all.
> --  Run "ldmadmin clean".
> --  Run pqcat and pqcheck.  If this doesn't produce a "tallied
> consistent" message, run "ldmadmin delqueue" and "ldmadmin mkqueue".
> 
> (6) Run "ldmadmin start".

This procedure should result in a restarted LDM.  I just wish it wasn't 
necessary.

> (Script attached.)
> 
> -- Dave

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: USJ-914724
Department: Support LDM
Priority: Normal
Status: Closed