[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20050325: ldm/pqact load question



>From:  Gerry Creager N5JXS <address@hidden>
>Organization:  Texas A&M University -- AATLT
>Keywords:  200503251531.j2PFViv2028115 LDM processing relay cluster

Hi Gerry,

>OK: I'm getting and storing all the data on bigbird.  I'm starting to 
>process, as it comes in, a lot of the MADIS data to databases, and make 
>gifs of most/all of the Level II dBz 0.5deg stuff, as well as some other 
>elevations, and products.  And I'm also starting to do some on-demand 
>gifs of the radar for another little project.

I just jumped on bigbird to take a quick look at load averages and
noticed that the age of the oldest product in your LDM queue is very
small:

 ...                                              +
                                                  V
20050326.1453   7.07  6.34  6.49   31  14  45    465   6M   6M  0 
scourBY(number|day)
20050326.1454   8.86  7.24  6.80   31  14  45    502  37M   6M  0 
scourBY(number|day)
20050326.1455   8.37  7.36  6.87   31  14  45    512  27M   6M  0 
scourBY(number|day)

Because of this, I decided to check your queue size, and see that it is
only 400 MB:

[ldm@bigbird ldm]$ ls -alt data/ldm.pq
-rw-rw-r--  1 ldm ldm 407732224 Mar 15 16:34 data/ldm.pq

Given the volume of data you are ingesting, it would be better if your
queue size was substantially larger than 400 MB.  I believe that you
were running a 2 GB or 4 GB queue in the past.  Since you have upgraded
your LDM from 6.1 to 6.2.1 and then 6.3.0 relatively recently (Feb 18
for 6.2.1 and March 15 for 6.3.0), I am assuming that the need to set
the queue size in the new ~ldm/etc/ldmadmin-pl.conf configuration file
was missed.  In LDM 6.2.1, configurations like $hostname, $pq_size, etc
were moved from ~ldm/bin/ldmadin into a persistent file
~ldm/etc/ldmadmin-pl.conf.  The confiuration entries in the new
configuration file are almost all the same as those in
~ldm/bin/ldmadmin.  The only exceptions are new entries that allow one
to better tune the LDM queue (e.g., $pq_slots = ).  I see that the
that $pq_size was left at the default 400 MB in ldmadmin-pl.conf:

$pq_size = "400M";

It may be wise to change this to something like 2 GB:

$pq_size = "2G";

and then remake the queue to the larger size:

<as 'ldm'>
-- edit ~ldm/etc/ldmadmin-pl.conf
ldmadmin stop
ldmadmin delqueue
ldmadmin mkqueue -f
ldmadmin start

The process can be sped up considerably by string all ldmadin invocations 
together on
a single command line:

ldmadmin stop && ldmadmin delqueue && ldmadmin mkqueue -f && ldmadmin start

>I'm running outta horsepower in the several machines I've been working 
>with.  One thing I'd been doing is getting the little subset of stuff I 
>needed to run on each machine, as a feed from bigbird, then processed it 
>locally.

OK.  The LDM overhead should not be much, so it must be the case that your
available processing power is being consumed by your processing.

>Where I'm heading is to nfs mount the data from bigbird, and then 
>process it into image directories either on the local machine and 
>cross-mount those to the webserver, or write them onto another 
>nfs-mounted directory.
>
>What are your thoughts on efficiencies and potential problems?

The one problem with NFS mounts is the dependencies it creates on the
order in which machines come up after shutdowns.  This is especially the
case when using the automounter.

Are you thinking that the NFS mounts will save CPU on the machines doing
the processing?  I am not sure if this will or will not be the case.  It
really depends on the NFS implementation.

While talking about OSes, I want to let you know that we have upgraded
to Fedora Core 3 on our 32-bit AND 64-bit platforms.  FC3 appears to be
quite a bit more stable AND faster than either FC2 or FC1.  While
experimenting with a cluster approach to IDD data relay (more
information included at the end of this email), we had a shootout
between Sun Solaris x86 5.10, FreeBSD 5.3, and Fedora Core 3 64-bit
Linux on three identically equipped Sun Sunfire V20Z boxes (dual
Opteron, 4 GB RAM, 2x36 GB 10000 RPM SCSI, in 1U rackmount cases).  All
three are 64-bit OSes, so the comparison was as fair as we could make it.
The _clear_ winner for IDD relay was FC3 64--bit; FreeBSD 5.3 came in
second (not bad, but not nearly as good as FC3); and Solaris x86 5.10
was a _distant_ third (performance was dismal in our testing).  Because
of our testing, we replaced FreeBSD and x86 with FC3 on all of our
boxes.

I mention our testing since I see that bigbird is running FC2.  Do
you know if your 3Ware RAID card is supported under FC3?  I have
a hunch that it is, since I have been led to believe that Redhat
Enterprise WS 4 is the RH-supported version of FC3.

>I've 
>started running LDM as a trigger for the afrementioned architecture: 
>When the Bird flings a dataset across, rather than filing or decoding, 
>etc., it triggers the processing script telling the code to look at the 
>NFS-mounted data.  Seems to be working pretty well so far.

OK.  You are using up bandwidth by sending the products across the
wire, but this may not be so bad depending on your resources.  Another
approach would be to create an EXP product on bigbird that contained
only the metadata from the IDD product and use that as a trigger on
the processing machines.

>I just want to make sure this thing will scale somewhat, or I'll be 
>doing this exercise over and over again!

It seems to me that the bottlenecks you will encounter in the future
are:

- servicing of more NFS clients from bigbird
- sending full sized products across the wire and throwing them away

Below I include an email that I sent to another user that is strongly
considering becoming a toplevel IDD relay node.  As you will see, the
note describes a cluster approach that we have been pursuing here in
the UPC.  As you read the info, please remember that we are still
learning about the cluster and _will_ be making changes to the setup in
the coming days/weeks/months/etc.  I offer the following in order to **
hopefully ** provoke a revisit to an effort you were involved in some
time back:  establishment of a toplevel IDD relay node at the Houston
GigaPop.  Please let me know what you think...

  From: Unidata Support <address@hidden>
  Date: Tue, 15 Mar 2005 18:41:01 -0700
  Subject: 20050315: IDD top level relay atm.geo.nsf.gov PSU (cont.) 
  
  re:
  >How should we proceed from here?
  
  Perhaps it would be useful if I described the setup we have been moving
  towards for our toplevel IDD relay nodes -- idd.unidata.ucar.edu and
  thelma.ucar.edu.  Let me warn you that I am not the expert in what I am
  about to say, but I think I can relate the essence of what we have been
  working on.  The real brains behind what I describe below are:
  
  John Stokes    - cluster design and implementation
  Steve Emmerson - LDM development
  Mike Schmidt   - system administration and cluster design
  Steve Chiswell - IDD design and monitoring
  
  I am sure that these guys will chime in when they see something I have
  mis-stated :-)
  
  As you know, in addition to atm.geo.nsf.gov we operate the top level
  IDD relay nodes idd.unidata.ucar.edu and thelma.ucar.edu.  Instead of
  idd.unidata and thelma.ucar being simple machines, they are part of a
  cluster that is composed of 'director's (machines that directs IDD feed
  requests to other machines) and 'data servers' (machines that are fed
  requests by the director(s) and service those requests).  We are using
  the IP Virtual Server (IPVS) available in current versions of Linux to
  forward feed requests from 'directors' to 'data servers'.
  
  In our cluster, we are using Fedora Core 3 64-bit Linux run on a set of
  identically configured Sun SunFire V20Z 1U rackmount servers:  dual
  Opterons; 4 GB RAM; 2x36 GB 10K RPM SCSI; dual GB Ethernet interfaces.
  We got in on a Sun educational discount program and bought our 5 V20Zs
  for about $3000 each.  These machines are stellar performers for IDD
  work when running Fedora Core 3 64-bit Linux.  We tested three
  operating systems side-by-side before settling on FC3; the others were
  Sun Solaris x86 10 and FreeBSD 5.3, both of which are 64-bit.  FC3 was
  the _clear_ winner; FreeBSD was second; and Solaris x86 10 was a
  _distant_ third.  As I understand it, RedHat Enterprise WS 4 is FC3
  with full RH support.
  
  Here is a "picture" of what idd.unidata.ucar.edu and thelma.ucar.edu
  currently look like (best viewed with fixed width fonts):
  
                |<----------- directors ------------>|
  
                    +-------+            +-------+
                    |       ^            |       ^
                    V       |            V       |
                +---------------+    +---------------+
  idd.unidata   | LDM   | IPVS  |    | LDM   | IPVS  |  thelma.ucar
                +---------------+    +---------------+
                        / \    |               |   / \
                       /   \   |               |  /   \
                      /     \  +----+          | /     \
             +-------/-------\------|----------+/       \
             |      /         \     |          /         \
             |     /           \    +----------------+    \
             |    /             \            /       |     \
             V   /               \          /        V      \
          +---------------+   +---------------+   +---------------+
          |  'uni2' LDM   |   |  'uni3' LDM   |   |   'uni4' LDM  |
          +---------------+   +---------------+   +---------------+
  
          |<----------------- data servers ---------------------->|
  
  The top level indicates two 'director' machines: idd.unidata.ucar.edu
  and thelma.ucar.edu (thelma used to be a SunFire 480R SPARC III
  box).  Both of these machines are running IPVS and LDM 6.3.0
  configured on a second interface (IP).  The IPVS 'director' software
  forwards port 388 requests received on a one interface configured as
  idd.unidata.ucar.edu on one machine and thelma.ucar.edu on the
  other.  The set of 'data server' backends are the same for both
  directors (at present).

  When an IDD feed request is received by idd.unidata.ucar.edu or
  thelma.ucar.edu it is relayed by the IPVS software to one of the data
  servers.  Those machines are configured to also be known internally
  as idd.unidata.ucar.edu or thelma.ucar.edu, but they do not ARP, so
  they are not seen by the outside world/routers.  The IPVS software
  keeps track of how many connections are on each of the data servers
  and forwards ("load levels") based on connection numbers (we will be
  changing this metric as we learn more about the setup).  The data
  servers are all configured identically: same RAM, same LDM queue size
  (8 GB currently), same ldmd.conf contents, etc.

  All connections from a downstream machine will always be sent to the
  same data server as long as its last connection has not died more
  than one minute ago.  This allows downstream LDMs to send an "are you
  alive" query to a server that they have not received data from in
  awhile.  Once there have been no IDD request connections by a
  downstream host for one minute, a new request will be forwarded to
  the data server that is least loaded.

  This design allows us to take down any of the data servers for
  whatever maintenance is needed (hardware, software, etc.) whenever we
  feel like it.  When a machine goes down, the IPVS server is informed
  that the server is no longer available, and all downstream feed
  requests are sent to the other data servers that remain up.  On top
  of that, thelma.ucar.edu and idd.unidata.ucar.edu are on different
  LANs and may soon be located in different parts of the UCAR campus.

  LDM 6.3.0 was developed to allow running the LDM on a particular
  interface (IP).  We are using this feature to run an LDM on the same
  box that is running the IPVS 'director'.  The IPVS listens on one
  interface (IP) and the LDM runs on another.  The alternate interface
  does not necessarily have to represent a different Ethernet device;
  it can be a virtual interface configured in software.  The ability to
  run LDMs on specific interfaces (IPs) allows us to run LDMs as either
  'data collectors' or as additional data servers on the same box
  running the 'director'.  By 'data collector', I mean that the LDMs on
  the 'director' machines have multiple ldmd.conf requests that bring
  data to the cluster (e.g., CONDUIT from atm, UIUC, and/or, NEXRAD2
  from Purdue, HDS from here, IDS|DDPLUS from there, etc.).  The data
  server LDMs request data redundantly from the 'director' LDMs.  We
  currently do not have redundancy for the directors, but we will be
  adding that in the future.

  We are just getting our feet wet with this cluster setup.  We will be
  modifying configuations as we learn more about how well the system
  works.  In stress tests run here at the UPC, we were able to
  demonstrate that one V20Z was able to handle 50% more downstream
  connections than the 480R thelma.ucar.edu without introducing
  latency.  With three data servers we believe that we can now field
  literally every IDD feed request in the world if we had to (the
  ultimate failover site).  If the load on the data servers ever
  becomes too high, all we need do is add one or more additional boxes
  to the mix.  The ultimate limiting factor in this setup will be the
  routers and network bandwidth here in UCAR.  Luckily, we have
  excellent networking!

The cluster that is currently configured relays an average of 120 Mbps
(~1.2 TB/day) to downsteam connections.  Peak rates can, however,
exceed 250 Mbps.

Please let me know what you think about he above!

Cheers,

Tom

>From address@hidden  Sun Mar 27 08:35:45 2005

Hi, Tom!

re: bigbird's LDM queue is 400 MB

>Stupid user error.  I *thought* I recalled that queues were now 
>automagically max'd, so I didn't check that.

>I'll do so.. .  If I correctly recall, I can now use '2G' or '4G' 
>instead of all the zeros.  If not, well, I'll know in a minute as it'll 
>fail and I'll reedit.

>Thanks for catching this... :-(

re: what is using processing capabilities

>Pretty much true, so I've distributed the gempak processing over a total 
>of 3 machines for Level II and one additional for the CONUS Level III 
>mosaic.  Another does my MADIS and IDD/DDPLUS ingest, as well as the 
>rest of the WMO feeds, HDS, etc.  So far, it's working pretty well, but 
>I suspect NFS isn't the best implementation, and I'm looking at 
>distributing processing via ssh/scp to the (well, I hesitate to refer to 
>it this way, but...) cluster of processing systems.

re: NFS mounts

>We initiate the mounts at boot time and leave 'em nailed.  However, 
>sometimes during higher loads, NFS loses lock for a few seconds to 
>minutes...

re: Are you thinking that the NFS mounts will save CPU

>Yeah.  And the Linux implementation isn't at the absolute top of the 
>heap.  In fact, knowing a little bit about NFS is the reason I'm 
>thinking about the ssh/scp route.  It should allow the data to be 
>snagged processed, and the results returned pretty efficiently.

re: UPC experience with dual Opteron machines running FC3

>I may revamp the bird to go toward a dual Opteron implementation. 
>Interesting.

re: bigbird running FC2

>Yeah.  It is.  Don't know if you recall, or if I told you.  When we 
>started having the real nightmares with FC2 and the 2.6 kernels, I 
>talked to one of our vendors.  He didn't like the sound of the problems 
>and called an engineering contact at 3Ware.  They sent us a replacement 
>controller (we had to replace the parallel ATA drives with S-ATA 
>ourselves but we got a *good* price on 300GB drives), and told us the 
>problems with the parallel ATA controller in 2.6 were real, were their 
>fault, and possibly not fixable.  Thus the new (and well-supported) 
>hardware.

re: send product metadata to trigger actions on downstream machines

>Hadn't thought of THAT.  Interesting idea.

re: bottlenecks to be faced - servicing of more NFS clients from bigbird

>Yeah.  Solution: private network to handle NFS.

re: - sending full sized products across the wire and throwing them away

>Not a network issue if we enable the private network (which we have, 
>overall).

re: UPC direction on clusters

>I think it's certainly do-able... looks like equipment-grant time for 
>next year, unless I can snag more money from other sources... that's not 
>impossible now, as some of the work I'm doing with NWS on GIS-based 
>dissemination, including the Polygon Warning tests 
>(http://mesonet.tamu.edu/PolygonTest/ but the site's still a little 
>quirky and we're trying to fix the nagging little things) may lead to 
>some money.  If it does, first goes to a grad student to dedicate to 
>that process, the rest goes toward better hardware for the LDM/IDD relays.

>All that said, I can do a mirror almost literally over the next couple 
>of days, using mesodata3 as the 2nd feed source, and ramp up 
>availability.  We could use round-robin DNS for the time being to get 
>the connections flowing.

>Concerning connectivity, the possibility remains of placing the hardware 
>back in Houston, but we're revamping the Texas network infrastructure. 
>At this time, I'm exactly 1 router away from the LEARN POP for TAMU, and 
>I'm helping drive some of the requirements.  In fact, the LDM and Level 
>II work are being used as drivers here for current and future work.  The 
>downside of that is that at some point, I'll have to come up with some 
>funding to support our network bandwidth requirements.  I'm considering 
>that in all new funding requests.

>The LEARN connection will provide a 10Gb/sec link throughout Texas, and 
>initially a pair of OC12 (622Mb/sec) interfaces to Internet2.  Our 
>Commodity Internet capability will also ramp up to at least 1Gb/sec over 
>the next 4 months or so.  As I start consuming more bandwidth, I'll have 
>to pay (as indicated above) but that's not tomorrow, and I won't be cut 
>off or throttled.  So far, I don't think I'm bandwidth-limited by my 
>current location.

>I've CC'd James Esslinger on this.  I snagged him to work for me as an 
>admin and facilities manager in our lab (we need to contrive a meeting 
>here for you to come visit).  I'll discuss this with James over the next 
>week and we'll see what we can do to start (or restart) the process of 
>adding TAMU as a top-level redistro site.

>In the mean time, feel free to add feed requests via Internet2 (that'd 
>be NOAA, and universities, for the most part) pointed at bigbird, and 
>let's see where we taper off.

>I'll talk with James about an upgrade program for the various boxes to 
>FC3.  For what it's worth, I've been very happy with it, and the main 
>reason we haven't migrated everything there was a desire to not break 
>systems that were already working "OK" on the various other systems.  We 
>actually still have a RH 8.0 system running... despite my comments in 
>the past that "friends don't let friends run RH 8.0".

>We'll be back in touch.  Thanks for the thoughts!
>Happy Easter!
>gerry

-- 
Gerry Creager -- address@hidden
Texas Mesonet -- AATLT, Texas A&M University    
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.847.8578
Page: 979.228.0173
Office: 903A Eller Bldg, TAMU, College Station, TX 77843