[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030612: data feed problems (cont.)



>From: Unidata Support <address@hidden>
>Organization: UCAR/Unidata
>Keywords: 200306111400.h5BE0HLd016841 IDD

Hi Adam (wiht CC to Chance),

I logged onto tornado this morning and upgraded it to use the
latest LDM available, LDM-6.0.13.  I also tuned its ~ldm/etc/ldmd.*
files to split feed requests and add some documentation.  Here is
a blow-by-blow of what I did:

<login as 'ldm'>
cd ~ldm

ftp ftp.unidata.ucar.edu
  <user> anonymous
  <pass> address@hidden
  cd pub/ldm
  binary
  get ldm-6.0.13.tar.Z
  quit

- Check to see if LDMHOME was set; it wasn't AND even though the default
  SHELL for 'ldm' is set to be csh, there is no .cshrc file.  I created
  .cshrc and populated it with:

umask 002

setenv LDMHOME /home/ldm

- Then I made those settings active:

source .cshrc

- Then on with the build:

cd ldm-6.0.13

./configure
make
make install

su
make install_setuids
exit

- Then, I adjusted the settings in the LDM-6.0.13 version of ldmadmin to
  match what you already have setup:

1 GB queue
10 log files

- There was no need to set $hostname in ldmadmin since 'uname -n' returns
  your fully qualified hostname.

After getting the LDM-6 ready to run, I next tuned your ~ldm/etc/ldmd.conf
entries.  They are now basically:

###############################################################################
#
# LDM5 servers request data from Data Sources
#
#       request <feedset> <pattern> <hostname pattern>
#
#request        WMO ".*" uni0.unidata.ucar.edu
#request        FNEXRAD ".*"    130.39.188.204
#request        UNIDATA|FSL|FNEXRAD     ".*"    129.15.192.81
#request        NLDN    "."     169.226.43.58
#request        NEXRAD  "/p...(SHV|JAN|LZK|LCH|POS|FWS|LIX)"    129.15.192.81
#request        NOGAPS  "^US058GMET-GR1mdl.0058_0240.*" 152.80.61.203

#
# History:  20030612 - split feed requests to decrease latency
#                      request all feeds by fully qualified host names
#
# Unidata-Wisconsin images, FSL wind profiler, NEXRAD floaters and composites
#
request UNIWISC|FSL2|FNEXRAD    ".*" stokes.metr.ou.edu

#
# Global observational data
#
request IDS|DDPLUS      ".*" stokes.metr.ou.edu

#
# NOAAPORT model output
#
request HDS     ".*" stokes.metr.ou.edu

#
# All Level III products from select NEXRADs
#
request NNEXRAD "/p...(SHV|JAN|LZK|LCH|POS|FWS|LIX)"    stokes.metr.ou.edu

#
# NLDN lightning data from SUNY Albany
#
request NLDN    "."     striker.atmos.albany.edu

#
# NOGAPS data from FNMOC
#
request NOGAPS  "^US058GMET-GR1mdl.0058_0240.*" usgodae3.fnoc.navy.mil


Notice the following:

1) all requests are made to fully qualified hostnames not IP addresses.
   This was done so that the real time statistics reporting will be
   able to do differential latencies from your machine to its upstream
   feed host(s).

2) I split up the compound request UNIDATA|FSL|FNEXRAD into several
   separate requests (LDM-6 does not accumulate requests to an upstream
   host into a single rpc.ldmd; this is by design and a _good_ thing).
   In doing the split, particularly notice that I made a single request
   for HDS.  More on this below.

3) (minor) I added some documentation to make the file easier to read.


After I noticed that you are running ldmfail, I made sure to modify all
ldmd.* files in ~ldm/etc.   They all read more or less the same as the
listing above.

I noticed that your ~ldm/logs directory was full of .stats files.
I took a look at your crontab entries for 'ldm' and added the
appropriate entry to report the pqbinstats logs (the ~ldm/logs/*.stats
files) back to the UPC.  This entry also scours those .stats files, so
you will now never have more than 24 on your system at one time (the
scouring leaves 24 on disk).  I also added some documentation to your
crontab entries (OK, I am anal about things like documentation :-).

After making all of the changes to the ldmd.conf files, I stopped and
restarted your LDM:

ldmadmin stop
<waited for all LDM rpc.ldmd processes to exit>

<check the queue to make sure it is OK>

pqcat -s > /dev/null

Seeing that the queue was OK, I started the new LDM:

cd ~ldm
rm runtime
ln -s ldm-6.0.13 runtime
ldmadmin start

tornado is now running LDM-6 and reporting real time LDM-6 stats back
to Unidata (it was reporting back LDM-5 stats previously).  You can
take a look at:

Real Time Stats homepage:
http://www.unidata.ucar.edu/staff/chiz/rtstats

  Statistics by Host
  http://www.unidata.ucar.edu/staff/chiz/rtstats/siteindex.shtml
  
  tornado.geos.ulm.edu [ 6.0.13 ] 
  
http://www.unidata.ucar.edu/staff/chiz/rtstats/siteindex.shtml?tornado.geos.ulm.edu

From the last page, you will see what feeds you are receiving (as
opposed to the list of feeds you are requesting) laid out in a table
whose entries are links to time series plots of things like latency,
log(latency), volume, # of products, and topology.

A quick look at latency plots for the various feeds pinpoints the data
reception problems you are having on tornado:  your original request
line for UNIDATA|FSL|FNEXRAD is actually a request for the following:

UNIWISC|HDS|IDS|DDPLUS|FSL2|FNEXRAD

and the latencies for the HDS feed are very high.  Also, I notice that
you are not getting any FNEXRAD data from seistan, so I am thinking
that they don't have it to relay in the first place.

After splitting the HDS ingest off of the rest of the ingests, the
latency for all other feeds rapidly fell to zero.  The latencies for
the HDS feed have remained unusually high, and so may indicate one or
both of two things:

- your internet connection to LSU (seistan) is not nearly as good as
  you might think

- seistan is having a problem in getting the HDS data itself

Since I had 'root' login, I decided to transfer over a program named
'mtr' into /usr/sbin.  'mtr' (Matt's TraceRoute) is a nifty tool for
showing the connectivity from your machine to any upstream host.  For
instance:

<as root>
/usr/sbin/mtr seistan.srcc.lsu.edu

'mtr' runs continuosly, so it shows the connection over a period of
time (unlike traceroute which is a one shot peek).  What 'mtr' does not
show, however, is how "big" the pipe is.  It shows that the connection
from ULM to LSU is electronically "near" (latencies are small), but it
does not show how well large products (files, etc.) could be moved
between the two.

So, what's the point, you may ask?  Our observation is that you are now
able to receive all feeds except HDS with little latency from upstream
IDD hosts.  The HDS feed from seistan is, however, a big problem that
must be investigated.  My initial thought is that there is some sort of
a firewall/packet limiting issue involved either at ULM or LSU.

As a first test, I added a request line to your current ldmd.conf
file for HDS data from emo.unidata.ucar.edu.

#
# NOAAPORT model output
#
request HDS     ".*" seistan.srcc.lsu.edu ALTERNATE
request HDS     ".*" emo.unidata.ucar.edu PRIMARY

PRIMARY in LDM-6 means that the upstream host will send all requested
data to the requestor without asking.  ALTERNATE means that the upstream
host will ask the downstream host if it wants each product, if yes
the product is sent.

This test should tell us if tornado is able to receive the HDS data
rapidly enough to drop their latencies down to zero, or if the
network connection at ULM is a bottleneck.  If the latencies do drop
to zero, it will mean that the HDS problem lies entirely at LSU.

-- after tornado has had a chance to ingest data from emo.unidata.ucar.edu --

The latencies for the HDS feed are _not_ dropping to zero like I would
expect they would _if_ the network pipe into ULM was "big".  I suspect,
therefore, that there is some limiting being done on your connection
to the internet.

More investigation is needed...

Tom