>From: "David B. Bukowski" <address@hidden> >Organization: COD >Keywords: 200504042331.j34NV1v2029887 McIDAS Xvfb Hi David, >Very odd Tom. Basically since we stopped the MCIDAS processes from >running (taking out of ldm and pqact) the server has NOT crashed/hung nor >did any Xvfb windows die. Nothing else has changed. I would guess that the xvfb failures were related to system resource starvation, and that at least part of the starvation was being caused by script-initiated McIDAS processing not releasing the shared memory segments that were allocated for their use upon exit. >On your point about >the SHMEM segments. Yes I've noticed lots of SHMIDs running espectially >when the system hangs/crashes/xvfb deaths. as of right now there are >none, but when I the MCIDAS processes are running, we have at least 5 >constantly (owned by ldm at that time) which I think may be the >decoders. Yes, there should be two or three depending on what McIDAS-XCD decoding has been setup: 1 - segment created by the 'exec xcd_run MONITOR' entry in ldmd.conf 2 - segment created by IDS|DDPLUS decoding 3 - segment created by HRS decoding On some OSes, these are actually created in pairs, so the numbers could be doubled. Why you have 5 segments when the LDM is running is a bit of a mystery to me, but I would tend to believe that some were created by pqact.conf entry-initiated processing. >but then when the system gets really hosed up i've had 2 pages >of just shmid's. So yes I have seen the runaway shmid case. Sounds like this may be be the root of the problem. >Just must be >total coincidence then that the Xvfb's die when mcidas is running cause >now they've been up for over 24 hours without running those processes. Well, I would venture to guess that Xvfb's require a substantial fraction of system resources, and when those resources are unavilable, they Xvfb(s) die. >Any other insight would be greatly appreciated, I hope paul checks the >scripts to make sure that things are being executed correctly and being >properly cleaned up. The only thing I can suggest is making sure that the scripts exit the 'mcenv' invocation that created the shared memory segment cleanly. By the way, this is akin to GEMPAK products needing to run GPEND to get rid of the message queue created for their communications. >side note on system hangs, when the cpu load is up to like 200 on >occasion. I cannot end the ldm process by ldmadmin stop.it seems to >hang on trying to stop the mcidas decoders, even kill -9 doesn't kill >them they are in a D state on top/ps. even ipcrm wont get rid of most >shmid's. have tried to do a safe reboot with the reboot command or a >shutdown command, no success. at these time's i've found myself having to >physically goto the machine (1 hour long drive from home at nights) to hit >the power button, which i hate doing but its the only way to recover. on >some of these occasions, i've noticed the kernel panic too. So i'm >thinking from your response the kernel panic may be possibly from a >runaway shmid case. Situations where one can not kill processes with a 'kill -9' are rare and not limited to McIDAS. We have seen instances under Solaris SPARC where the same thing can happen for processes that are using a lot of system resources. I have always considered these situations to be more a function of the OS than the package since _no_ user-space application should be allowed by the OS to intefere with system-level functions. >FYI. is there anything from the sysctl that u see i may need to chnage, >i have allocated shmmax at 512 as per install instructions for mcidas >kernel/threads-max = 102400 >kernel/cad_pid = 1 >kernel/sem = 250 32000 32 128 >kernel/msgmnb = 16384 >kernel/msgmni = 16 >kernel/msgmax = 8192 >kernel/shmmni = 4096 >kernel/shmall = 2097152 >kernel/shmmax = 536870912 The last 8 of these settings are identical to what I have setup on a Fedora Core 1 Linux box here in my office EXCEPT that I did not increase shared memory to 512 MB. Since your problem appears to be related to shared memory starvation, I would try decreasing the shmmax value to something like 32 MB. This is especially the case if you do not try to run multiple interactive McIDAS sessions. >thanks for any help I am curious if you see problems when just running McIDAS-XCD decoding (i.e., not kick off scripts to do processing). I could imagine a case where the release of the shared memory is not working on your system. Perhaps Debian uses a slightly different call to release shared memory (you are using Debian, aren't you)? Cheers, Tom -- NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us. >From address@hidden Tue Apr 5 13:58:28 2005 Ok, i'll give the shmemmax variable a drop to 32 meg to see if that could be something, the system does have 6 gig of physical ram and another 4 gig swap. running kernel 2.6.4. need to upgrade the kernel a bit too since the are like on 188.8.131.52 now. I'll let u know what I find over the next few days and will make a redirect of the ipcs to a text file to post next time. but yes there are 3 pairs 6 total, ownership LDM. just a quick overview. LDM runs the processes that are executed from pqact and the decoders, while MCIDAS runs cronjobs. I will have to see what the ownership is again next time the system goes all wacko, thanks for the tips, will keep you posted. -dave (yes debian)
NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.