[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20030630: IDD feeds from LSU to any non LSU downstream sites (cont.)



>From: Robert Leche <address@hidden>
>Organization: LSU
>Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD

Bob,

>As we have switched to  'event mode' with the hurricane in the gulf, I 
>have had to drop the network investigation. Today is out, and at least 
>part of tomorrow.  Also, I have lost   email over the last 4 days. 
>Please resend the emails you sent  from Friday on.
>
>Speaking of hurricanes, our computer "Hurricane" died. I am in the 
>process of rebuilding it with Gempak to bring to the Office of Emergency 
>Preparedness. Murphy's Law.. If it can fail, it will!

The most important thing I asked for in email sent since last Friday
was for you to contact the telecomm folks at LSU and/or LANET to see
what they possibly did over the weekend to first make the HDS latencies
from seistan to zero.unidata.ucar.edu drop significantly starting on
Friday evening, AND then to rise back starting on Sunday afternoon.
Whatever was done holds the information for finally closing out the
feed problems being experienced by sites downstream of LSU.


Tom


Here are all of the messages I sent to you since last Friday morning:

From address@hidden Sun Jun 29 20:06:45 2003
  To: address@hidden
  cc: address@hidden, Kevin Robbins <address@hidden>
  Subject: 20030628: HDS feed to/from seistan (cont.) 
  
  >From: Unidata Support <address@hidden>
  >Organization: UCAR/Unidata
  >Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
  
  Bob,
  
  Well, after most of a weekend of pretty good HDS latencies from seistan
  to zero.unidata.ucar.edu, the feed problems reappeared.  This can be
  seen by the 'latency' plot from the real time statistics page:
  
  
http://www.unidata.ucar.edu/cgi-bin/rtstats/iddstats_nc?HDS+zero.unidata.ucar.edu
  
  The questions now are:
  
  - what changed at LSU/LANET on Saturday to make the latencies drop to near
    what they should be
  
  - what changed at LSU/LANET on Sunday afternoon to make the latencies
    climb to their previous bad levels
  
  I think a call to the LSU telecomm folks is in order.  If you can't get
  anywhere with them (please try, you should have more clout with them
  than us), can you send along their contact information?


From address@hidden Sat Jun 28 07:53:30 2003
  To: address@hidden
  cc: address@hidden, Kevin Robbins <address@hidden>
  Subject: 20030627: HDS feed to/from seistan (cont.) 
  
  >From: Robert Leche <address@hidden>
  >Organization: LSU
  >Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
  
  Hi Bob,
  
  >take a look at the following two cases. Notice the LSU to ULM hop is
  >via the network address translation firewall:
  >dynip422.nat.ulm.edu.(Line 6) Interestingly, ULM does not pass through
  >the same NAT/firewall process. I beleive this could offer a clue.
  >The traceroute report is missing the last 3 hops and untill the firewall at
  >ULM is opened allow you to ping tornado we will not have a complete picture.
  
  I don't think that this has anything to do with the feed problems we
  were seeing from LSU to others.  It only explains the inability to do
  complete traceroutes to ULM.  This is/was not part of the feed problems
  we have been seeing.
  
  >Some how two different paths are connecting ULM. And this suggests a
  >reason why it takes more time to issue packets from Seistan to
  >Torando.
  
  It does not explain the asymmetry in feed to/from UCAR.  ULM has been
  out of the picture as far as high volume data feeds from seistan for
  well over a week now.  Ever since I switched them to feed from
  CU/CIRES (rainbow.al.noaa.gov), their HDS latencies have been at or
  very near zero.
  
  Now, back to the problem at hand.  Something significant changed yesterday
  night:
  
  - the HDS latencies from seistan to zero.unidata.ucar.edu dropped to
    near zero after a spike at around 7Z
  
  - for the first time since setting up the feed test from emo.unidata.ucar.edu
    to seistan and then back out to zero.unidata.ucar.edu, all HDS data was
    relayed from seistan to zero.unidata.ucar.edu
  
  - latencies for all feeds from seistan to tornado.geos.ulm.edu
    (e.g., FSL2, IDS|DDPLUS, UNIWISC, and NNEXRAD) dropped significantly
  
  Given these three observations from the real time statistics page:
  
  http://www.unidata.ucar.edu/staff/chiz/rtstats/siteindex.shtml
  
  for seistan.srcc.lsu.edu, zero.unidata.ucar.edu, and tornado.geos.ulm.edu
  I conclude that something changed in the network path out of LSU or
  in LANET.
  
  Did you receive a change notification from the LSU telecomm folks?  If
  not, will you contact them to find out exactly what was done?  A complete
  picture of what went wrong and its fix will help others if they run into
  similar problems.


From address@hidden Fri Jun 27 12:11:39 2003
  To: address@hidden
  cc: address@hidden, Kevin Robbins <address@hidden>
  Subject: 20030627: HDS feed to/from seistan (cont.) 
  
  >From: Robert Leche <address@hidden>
  >Organization: LSU
  >Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
  
  Hi Bob,
  
  re: ULM rerouted their traffic from I2 to "I1:
  
  >I did not know this happened, but it explains why ULM is able to communicate
  >with rainbow.al.noaa.gov.
  
  The ULM folks told us that during a total outage at LSU at some point in
  the past they fed from thelma.ucar.edu and experienced no problems.  This
  predated either your or ULM's upgrade to LDM-6 by quite a bit.
  
  Here a portion of the original note we received on problems ULM was
  having feeding from srcc.lsu.edu:
  
    "For more than a year, we have been having serious data feed problems
    when our upstream site is at LSU (sirocco).  We have tried everything
    that we can, including contacting LSU repeatedly, but cannot seem to
    resolve the situation satisfactorily.  We have worked extensively with
    our network people and believe that the problem is at LSU.  We are
    basing this conclusion on the fact that, while sirocco was down and we
    were feeding from Unidata's thelma machine, everything was fine.  We
    received all data without significant losses.  However, once sirocco
    came on-line again and we switched over to them, we began to experience
    substantial losses of data.  Our fallback site is OU's stokes machine
    and we have used them in the past, but they are feeding so many sites
    that we tend to fall significantly behind in the data feed.
    
    Can you help us resolve this problem?"
  
  >It would be interesting to also force an I1 connection to LSU and repeat
  >the test. 
  
  I agree, running feed tests using a different route to/from LSU would
  certainly be welcome.
  
  re: "I1"
  
  >Internet one?
  
  That is what we asked.
  
  >A better question in this case is, what is I2 in the context
  >to  the LANET sonnet connecting ULM to LANET?
  
  Here is the route from ULM to seistan.srcc.lsu.edu:
  
                             Matt's traceroute  [v0.49]
  tornado.geos.ulm.edu                                   Fri Jun 27 10:56:14 
2003
  Keys:  D - Display mode    R - Restart statistics    Q - Quit
                                             Packets               Pings
  Hostname                                %Loss  Rcv  Snt  Last Best  Avg  Worst
   1. 10.16.0.1                              0%   18   18     1    1    1      1
   2. 10.1.1.1                               0%   18   18     0    0    0      1
   3. 198.232.231.1                          0%   18   18     0    0    0      1
   4. laNoc-ulm.LEARN.la.net                 0%   17   17    13   13   19     76
   5. lsubr-laNoc.LEARN.la.netponse 2. (serve0%   17   17    14   14   15     26
   6. howe-e241a-4006-dsw-1.g1.lsu.edu       0%   17   17    18   15   22     50
   7. seistan.srcc.lsu.edu                   0%   17   17    15   14   19     42
  
  
  This can be compared with LSU's route from seistan to tornado.geos.ulm.edu:
  
  
                             Matt's traceroute  [v0.49]
  seistan.srcc.lsu.edu                                   Fri Jun 27 10:58:56 
2003
  Keys:  D - Display mode    R - Restart statistics    Q - Quit
                                             Packets               Pings
  Hostname                                %Loss  Rcv  Snt  Last Best  Avg  Worst
   1. 130.39.188.1                           0%   11   11     4    1    2      
5 
   2. lsubr1-118-6509-dsw-1.g2.lsu.edu       0%   11   11     1    0    1      
1 
   3. laNoc-lsubr.LEARN.la.net               0%   11   11     2    1    2      
4 
   4. ulm-laNoc.LEARN.la.net                 0%   11   11    14   14   36     
91 
   5. 198.232.231.2                          0%   11   11    29   14   41    
127 
   6. dynip422.nat.ulm.edu                   0%   11   11    16   15   25     61
   7. tornado.geos.ulm.edu                   0%   10   10    15   14   16     23
  Resolver: Received error response 2. (server failure)
  
  
  >My limited understanding of
  >what I2 is, is that  traffic is I2 if it passes through Abilene's system.
  
  I believe that is correct.
  
  >That being the case, unless ULM is passing through Abilenes routers, ULM
  >is really on I1 anyway.
  
  Please see the route above.  This, at least, reflects ULM's current
  connection to LSU.  UCAR's connection to ULM, however, traverses I2
  until Houston where the bridge is made to LEARN.La.Net:
  
  zero.unidata.ucar.edu -> tornado.geos.ulm.edu:
  
                             Matt's traceroute  [v0.44]
  zero.unidata.ucar.edu                                  Fri Jun 27 12:02:58 
2003
  Keys:  D - Display mode    R - Restart statistics    Q - Quit
                                             Packets               Pings
  Hostname                                %Loss  Rcv  Snt  Last Best  Avg  Worst
   1. flra-n140.unidata.ucar.edu             0%   71   71     0    0    0     29
   2. gin-n243-80.ucar.edu                   0%   71   71     0    0    0      6
   3. frgp-gw-1.frgp.net                     0%   71   71     1    1    2     25
   4. 198.32.11.105                          0%   71   71     1    1    1      6
   5. kscyng-dnvrng.abilene.ucaid.edu        0%   71   71    12   12   13     26
   6. hstnng-kscyng.abilene.ucaid.edu        0%   71   71    27   27   27     27
   7. laNoc-abileneHou.LEARN.La.Net          0%   71   71    33   32   33     36
   8. ulm-laNoc.LEARN.La.Net                 0%   70   70    45   45   46     71
   9. ???
  
  tornado.geos.ulm.edu -> zero.unidata.ucar.edu
  
                             Matt's traceroute  [v0.49]
  tornado.geos.ulm.edu                                   Fri Jun 27 13:04:05 
2003
  Keys:  D - Display mode    R - Restart statistics    Q - Quit
                                             Packets               Pings
  Hostname                                %Loss  Rcv  Snt  Last Best  Avg  Worst
   1. 10.16.0.1                              0%    4    4     1    1    1      1
   2. 10.1.1.1                               0%    4    4     0    0    0      0
   3. 198.232.231.1                          0%    4    4     0    0    0      0
   4. laNoc-ulm.LEARN.la.net                 0%    4    4    13   13   13     13
   5. abileneHou-laNoc.LEARN.la.net 2. (serve0%    4    4    18   18   25     45
   6. kscyng-hstnng.abilene.ucaid.edu        0%    3    3    34   34   34     34
   7. dnvrng-kscyng.abilene.ucaid.edu        0%    3    3    44   44   44     44
   8. 198.32.11.106                          0%    3    3    44   44   44     45
   9. gin.ucar.edu                           0%    3    3    46   45   45     46
  10. flrb.ucar.edu                          0%    3    3    45   45   46     46
  11. zero.unidata.ucar.edu                  0%    3    3    56   45   49     56
  
  
  re: ULM rerouted away from the problematic I2 connection
  
  >LANET indicated this trouble ticket
  >has been open for "some time". We do not know what "some time" means in terms
  >of  days or months.
  
  It would be useful to know how long that trouble ticket has been open.
  
  >CRC, and retransmission errors are consistent with delays
  >in network traffic.
  
  I agree.
  
  re: is CRC and retransmission (trouble ticket at LANET) affecting LSU also
  
  >I  think the communication issue will require resolving before we will
  >know.
  
  The really strange part is the asymmetry in the problem.  Since we are
  are feeding seistan.srcc.lsu.edu the HDS stream from
  emo.unidata.ucar.edu with no latencies, while at the same time we are
  _unable_ to feed the data back to a different machine here at the UPC,
  zero.unidata.ucar.edu (zero and emo are in the same room on the same
  subnet), perhaps a look at the route from Unidata to seistan and back
  again would be instructive:
  
  zero.unidata.ucar.edu -> seistan.srcc.lsu.edu
  
                             Matt's traceroute  [v0.44]
  zero.unidata.ucar.edu                                  Fri Jun 27 10:16:40 
2003
  Keys:  D - Display mode    R - Restart statistics    Q - Quit
                                             Packets               Pings
  Hostname                                %Loss  Rcv  Snt  Last Best  Avg  Worst
   1. flra-n140.unidata.ucar.edu             0%    8    8    10    0    1     10
   2. gin-n243-80.ucar.edu                   0%    8    8     0    0    0      0
   3. frgp-gw-1.frgp.net                     0%    8    8     1    1    1      2
   4. 198.32.11.105                          0%    8    8     1    1    1      1
   5. kscyng-dnvrng.abilene.ucaid.edu        0%    8    8    22   12   13     22
   6. hstnng-kscyng.abilene.ucaid.edu        0%    8    8    27   27   27     27
   7. laNoc-abileneHou.LEARN.La.Net          0%    8    8    33   33   33     33
   8. lsubr-laNoc.LEARN.La.Net               0%    8    8    34   34   34     34
   9. howe-e241a-4006-dsw-1.g2.lsu.edu       0%    8    8    39   35   37     42
  10. seistan.srcc.lsu.edu                   0%    7    7    34   34   34     35
  
  
  seistan.srcc.lsu.edu -> zero.unidata.ucar.edu
  
                             Matt's traceroute  [v0.49]
  seistan.srcc.lsu.edu                                   Fri Jun 27 11:15:53 
2003
  Keys:  D - Display mode    R - Restart statistics    Q - Quit
                                             Packets               Pings
  Hostname                                %Loss  Rcv  Snt  Last Best  Avg  Worst
   1. 130.39.188.1                           0%   14   14     1    1    3     16
   2. lsubr1-118-6509-dsw-1.g2.lsu.edu       0%   14   14     0    0    1      6
   3. laNoc-lsubr.LEARN.la.net               0%   14   14     2    1    2      5
   4. abileneHou-laNoc.LEARN.la.net          0%   14   14     8    7   16     46
   5. kscyng-hstnng.abilene.ucaid.edu        0%   14   14    23   22   22     23
   6. dnvrng-kscyng.abilene.ucaid.edu        0%   14   14    33   33   36     71
   7. 198.32.11.106                          0%   14   14    34   33   36     59
   8. gin.ucar.edu                           0%   14   14    35   34   35     45
   9. flrb.ucar.edu                          0%   14   14    34   34   35     45
  10. zero.unidata.ucar.edu                  0%   13   13    34   34   36     57
  
  
  The major difference in routes that I notice is the route from zero
  to seistan goes through howe-e241a-4006-dsw-1.g2.lsu.edu, but the
  route from seistan to zero goes through lsubr1-118-6509-dsw-1.g2.lsu.edu.
  
  Perhaps this is a big clue that we are overlooking?  Could it be
  that there is something amiss on the howe-e241a-4006-dsw-1.g2.lsu.edu
  gateway/router?
  
  re: What did the telecomm folks have to say about the asymmetry seen moving
  data to/from srcc.lsu.edu from zero.unidata.ucar.edu?
  
  >The issue of asymmetry was not the paramount issue with telecom. Again, the
  >telecom guys want to wait and see  the communications issues are fixed, as
  >they believe the errors in the circuit are causing the problems between LSU
  >and ULM.  
  
  The problem is not _just_ between LSU and ULM.  We (zero.unidata.ucar.edu)
  are seeing the exact same problem that ULM was seeing when trying to
  feed HDS from seistan.srcc.lsu.edu.  Moreover, we saw the exact same
  problem during our test of feeding the HDS stream from
  seistan.srcc.lsu.edu to the University of South Florida machine,
  metlab.cas.usf.edu.  The problem most likely exists between seistan
  and Jackson State, but we can't verify this because they are not reporting
  stats AND we do not have current contact information for them.
  
  If the LSU telecomm folks are under the impression that the only
  problem is between LSU and ULM, then they need to be contacted and made
  aware of the problems going to such diverse sites as UCAR and USF.


>From address@hidden Fri Jun 27 07:34:11 2003
  To: address@hidden
  cc: Kevin Robbins <address@hidden>, address@hidden
  Subject: 20030626: 20030624: HDS feed to/from seistan (cont.) 
  
  >From: Robert Leche <address@hidden>
  >Organization: LSU
  >Keywords: 200306161954.h5GJs2Ld016710 LDM-6 IDD
  
  Hi Bob,
  
  >In talking with our telecommunications people:
  >
  >1) The Louisiana Office of Telecommunications ("LANET")  was contacted with
  >the problem and LANET reports Bell South (The states communications provider)
  >has an open trouble ticket on the Public Switched sonnet network connecting
  >ULM to the LANET. The trouble ticket reports:  CRC, Retransmission errors. 
  >This is a  DS-3 Private Virtual Circuit  (PVC) on the Public Switched sonnet
  >network connecting ULM to the LANET.
  
  This sounds like the problem we uncovered at ULM.  They contacted their
  service provider and rerouted their traffice from I2 to "I1".  We
  never did get a reply from them as to what "I1" means.  After they
  rerouted away from their problematic I2 connection, we were able to feed
  all of HDS to them with virtually no latency.
  
  >LANET indicated this trouble ticket
  >has been open for "some time". We do not know what "some time" means in terms
  >of  days or months. CRC, and retransmission errors are consistent with delays
  >in network traffic.
  
  Is this also affecting the LSU connection?  If not, there is still a
  problem to be solved.
  
  >2) Concerning Ping (ICMP):
  >   A) LSU has limitation's placed on ICMP payload sizes to limit "the
  >Ping Of Death" hacks. So it is interesting that even though LSU has this
  >policy in place  we can demonstrate  large ICMP traffic to correctly query 
  >systems other than ULM but not ULM.
  
  OK.
  
  >   B) The telecommunications people pointed out that Cisco router interface
  >ping (ICMP) buffers have a hard limitation of 18,000 bytes. Unix/Linux 
systems
  >do not have this issue. So the theory goes....  Ping LSU's Cisco border 
router,
  >then LANET's Cisco border router and problems seem apparent. Yet ping an
  >UNIX device with a large pay load beyond the Cisco device and travel time
  >delays suddenly do not seem excessive.
  
  I understand.  Even still, the pings with large ICMP packets from
  seistan (RedHat 7.2 Linux) to zero.unidata.ucar.edu (Sun Solaris SPARC
  5.9) show dramatic round trip time increases after the ping packet size
  exceeds 20KB.  The 18000 byte limit you note does seem like what were
  were seeing when trying to ping laNoc-lsubr.LEARN.la.net.
  
  >3) It would be interesting to know who ULM is feeding HDS from. Chances are,
  >the communications circuit they are currently using is the same DS-3 circuit
  >that LANET uses.
  
  Right now, ULM is feeding HDS from rainbow.al.noaa.gov (this is a
  CU/CIRES lab here in Boulder).  We also fed them with no latency from
  emo.unidata.ucar.edu.
  
  >4) Limitations placed on ICMP payload sizes  on any devices in a networks
  >path will cause problems in using  ICMP round trip time to measure network
  >metrics.  But at this time, I do not have an alternative method to measure
  >network latencies. My network guy said network latencies issues are handled
  >by the circuit provider. No help there.
  
  The ping packet size issue was just an interesting observation.  The
  real issue is the latency when feeding the HDS stream out of LSU as
  compared to virtually no latency when feeding the HDS stream _into_
  LSU.  This observation is something that the telcomm people should be
  able use to help isolate where the throttling is occuring on or near
  the LSU campus.  Our being able to feed ULM all of the HDS feed from at
  least two other sites and our not being able to feed HDS from seistan
  but being able to feed seistan shows us that the problem is not at ULM,
  but at LSU.
  
  What did the telecomm folks have to say about the asymmetry seen moving
  data to/from srcc.lsu.edu?

Tom

>From address@hidden Mon Jun 30 12:30:54 2003
>To: Unidata Support <address@hidden>
>Subject: Re: 20030630: IDD feeds from LSU to any non LSU downstream sites 
>(cont.)

>Tom,

>thanks for sending the email to me.

>The LSU telcom folks report no changes where made with the LSU network
>configuration over the weekend. The LANET part of this remains to be
>answered, and our telcom will contact them.

>Just to let you know, we have not made any changes to Seistan either.