ldmadmin scour
" takes too long
If that is inconclusive, then start the LDM in interactive mode by explicitly
telling it to log to the standard error stream via the -l
option.
Execute the command
ldmd -l- [-v]
This will prevent the LDM from daemonizing itself and it will log directly to
the terminal. The running LDM can be stopped by typing ^C
(control-C).
ALLOW
entry for localhost
and
127.0.0.1
ulogger This is a test 2>/dev/null
tail -1 ~/var/logs/ldmd.log
If the above prints "This is a test", then LDM logging works.
ls -l ~/var/logs/ldmd.log
If it's not, then make it so:
sudo chown ldm ~/var/logs/ldmd.log
chmod u+w ~/var/logs/ldmd.log
df ~/var/logs/ldmd.log
If it's full, then purge stuff.
~/bin/refresh_logging
" doesn't exist or simply
executes the utility hupsyslog(1)
, then
ps -ef | grep syslog
If it's not running, then start it.
grep local /etc/*syslog.conf | grep ldm
hupsyslog(1)
is owned by root and setuid:
ls -l ~/bin/hupsyslog
make root-actions
" in the top-level LDM
source-directory.
ulogger
command above.
getenforce
To change from enabled mode to permissive mode, execute — as root
— the command
setenforce permissive
To disable SELINUX, edit the file /etc/selinux/config
and set
the variable SELINUX
to disabled
. Then, reboot the
system.
~/bin
directory
does not have the nosuid
attribute:
dev=`df ~/bin | tail -1 | awk '{print $1}'`
mount | grep $dev | grep nosuid
If the nosuid
attribute is enabled, then hupsyslog
will not work. Either that attribute must be disabled or the LDM package
must be re-installed on a disk partition that has that attribute
disabled.
notifyme -v [-f feedtype] [-p pattern] -o 9999999
where feedtype
and pattern
are a feed
specification and extended regular expression, respectively, that match the
missing data-products.
If this command indicates that the LDM is unavailable, then start it; otherwise, continue.
notifyme -v [-f feedtype] [-p pattern] -o 9999999 -h host
where feedtype
and pattern
are as
before and host is the hostname of the upstream LDM system (you can
get this from the relevant REQUEST
entries in the LDM
configuration-file).
If the notifyme(1)
command indicates that
telnet host 388
Contact your network administrator and show them the telnet(1)
command.
ALLOW
entry for the downstream LDM in its
configuration-file. This could be because the upstream LDM allows the
downstream LDM by name but that name
can't be determined by the upstream LDM by performing a
reverse-DNS lookup on the downstream host's IP address. A reverse-DNS
lookup can be verified via the command
dig -x IP_address
or
nslookup IP_address
where IP_address is the IP address of the downstream LDM's host.
If the reverse-DNS lookup fails and the upstream LDM allows by
name, then you should contact your network administrator and show them
the result of the above command. If the reverse-DNS lookup succeeds,
then the upstream LDM likely doesn't have an ALLOW
entry
for the downstream LDM and you should contact the upstream LDM user.
REQUEST
entry in the downstream LDM's configuration-file (Does it exist? Is
it correct?); or
ALLOW
entry in the upstream LDM's configuration-file to
prevent the downstream LDM from receiving the data-products. You'll
need to contact the upstream LDM user.
REQUEST
entry to a host whose LDM
is receiving the data-products or execute this section on the
upstream system.
SIGKILL
with extreme prejudice.
It turns out, that's exactly what happened. Only, it wasn't the superuser per se, but the out-of-memory manager acting on behalf of the superuser. The smoking gun is an entry in the system log file from the out-of-memory manager about terminating the LDM process around the time that it disappears.
The current workaround is to tell the out-of-memory (OOM) manager that the LDM processes are important by assigning the LDM process-group a particular "score". LDM user Daryl Herzmann explains:
So there is a means to set a "score" on each Linux process to inform the oom killer about how it should prioritizing the killing. For RHEL/centos 6+7, this can be done by `echo -1000 > /proc/$PID/oom_score_adj`. For some other Linux flavours, the score should be -17 and the proc file is oom_adj. Google is your friend!A simple
crontab(1)
entry like so will set this value for ldmd automatically each hour.Of course, this solution would have a small window of time between a ldm restart and the top of the next hour whereby the score would not be set. There are likely more robust solutions here I am blissfully ignorant of.
1 * * * * root pgrep -f "ldmd" | while read PID; do echo -1000 > /proc/$PID/oom_score_adj; done
The OOM killer can be completely disabled with the following command. This is not recommended for production environments, because if an out-of-memory condition does present itself, there could be unexpected behavior depending on the available system resources and configuration. This unexpected behavior could be anything from a kernel panic to a hang depending on the resources available to the kernel at the time of the OOM condition.
sysctl vm.overcommit_memory=2
echo "vm.overcommit_memory=2" >> /etc/sysctl.conf
ldmadmin scour
" takes too longPlease, I don't wish to start a war regarding which filesystem is the best here... If you have used XFS (now default filesystem in RHEL7) in the past, you may have suffered from very poor performance with IO related to small files. For me and LDM, this would rear its very ugly head when I wished to `ldmadmin scour` the /data/ folder. It would take 4+ hours to scour out a days worth of NEXRAD III files. If you looked at output like sysstat, you would see the process at 100% iowait. I created a thread about this on the redhat community forums[1] and was kindly responded to by one of the XFS developers, Eric Sandeen. He wrote the following:This is because your xfs filesystem does not store the filetype in the directory, and so every inode in the tree must be stat'd (read) to determine the filetype when you use the "-type f" qualifier. This is much slower than just reading directory information. In RHEL7.3, mkfs.xfs will enable filetypes by default. You can do so today with "mkfs.xfs -n ftype=1".So what he is saying is that you have to reformat your filesystem to take advantage of this setting. So I did some testing and now `ldmadmin scour` takes only 4 minutes to transverse the NEXRAD III directory tree!