[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #UYH-624598]: LVS realserver switching loses data



Art,

> I think I have a definitive, reproducable, case showing data loss by the ldm
> during network outages for streams that are running behind.  One caveat --
> the below test showed the downstream test system disconnecting from realserver
> iddrs3 and reconnecting to a different realserver, idd-ingest (a.k.a. iddrs2).
> However, I feel the reconnect still should have worked without data loss.
> I'm going to run a slightly different test this morning and hopefully that
> one will stay on the same system.
> ____________________
> 
> My systems:
> 
> Test system (apps1):  RHEL WS 4U3, LDM 6.4.6.4
> Realserver iddrs3:  RHEL AS 4U3, LDM 6.4.6.4
> Ingest server/Realserver idd-ingest (a.k.a. iddrs2):  RHEL AS 4U3, LDM 6.4.5
> 
> I'm not sure why iddrs2 was running 6.4.5... I've now upgraded that.
> ____________________
> 
> Here's what I did:
> 
> Feed path (CONDUIT):
> idd.unidata.ucar.edu -> idd-ingest.meteo.psu.edu (a.k.a. iddrs2) -> to 
> realservers
> realservers:  idd-ingest.meteo.psu.edu (a.k.a. iddrs2)
> iddrs3.meteo.psu.edu
> realservers -> apps1.meteo.psu.edu (the test system)
> 
> I set up the test system (apps1) to feed all data from virtual address
> ldm.meteo.psu.edu.  No pqact's were running.  Here's the request lines:
> 
> Test system ldmd.conf request lines:
> 
> request PPS     ".*"    ldm.meteo.psu.edu
> request DDS     ".*"    ldm.meteo.psu.edu
> request IDS     ".*"    ldm.meteo.psu.edu
> request FSL2    ".*"    ldm.meteo.psu.edu
> request DIFAX   ".*"    ldm.meteo.psu.edu
> request MCIDAS  ".*"    ldm.meteo.psu.edu
> request NIMAGE  ".*"    ldm.meteo.psu.edu
> request HDS     ".*"    ldm.meteo.psu.edu
> request CONDUIT ".*"    ldm.meteo.psu.edu
> request NGRID   ".*"    ldm.meteo.psu.edu
> request SPARE   ".*"    ldm.meteo.psu.edu
> request NEXRAD  ".*"    ldm.meteo.psu.edu
> request FNEXRAD ".*"    ldm.meteo.psu.edu
> request NEXRD2  "L2-BZIP2/(KCCX|KPBZ|KDIX|KBGM|KLWX|KBUF|KCLE)" 
> ldm.meteo.psu.edu
> request NLDN    ".*"    ldm.meteo.psu.edu
> request WSI     ".*"    ldm.meteo.psu.edu
> request GEM     ".*"    ldm.meteo.psu.edu
> request NOGAPS  ".*"    ldm.meteo.psu.edu
> 
> The feed was initially from the iddrs3 realserver.  I simulated a network
> disconnect by doing a "network stop" on apps1, waiting for 2 minutes from
> 22:28Z to 22:30Z, and then doing a "network start".  Upon resumption, apps1
> reconnected to the LDM on realserver idd-ingest (a.k.a. iddrs2).  After the
> reconnect, CONDUIT data did not start coming in again until its age reached
> the time of the network restart, so all CONDUIT data from the backlog age at
> the time of the network stoppage up to the network restart time was lost.

Would you please elucidate the meaning of "CONDUIT data did not start
coming in again until its age reached the time of the network restart".

> As an additional note, I ran this test a couple days ago on only CONDUIT data
> (no other feeds were requested) and it stopped/restarted correctly with no
> data loss.  So the problem seems related to other data streams coming in along
> with CONDUIT.  It should also be noted that the CONDUIT data runs late at peak
> data times (our current latencies go up to ~1500 to 2000 seconds with a split
> feed) while the other feeds have very low latencies.
> 
> The LDM was not stopped/restarted on any systems during the test.  Only
> the network on apps1 was stopped for 2 minutes.  All systems are running 
> rpc.ldmd
> with "-m 21600 -o 21599".
> 
> ____________________
> 
> Here's what I observed from the logs:
> 
> TEST MACHINE (apps1):
> Oct 26 15:09:31 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026134122.275 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=f42c74d4fef623f5cb04d2d0d877d1fd"}}
> Oct 26 15:09:31 apps1 ldm.meteo.psu.edu[7673] NOTE: Upstream LDM-6 on 
> ldm.meteo.psu.edu is willing to be a primary feeder
> Oct 26 22:29:01 apps1 ldm.meteo.psu.edu[7673] ERROR: Couldn't get IP address 
> for upstream host ldm.meteo.psu.edu; Couldn't resolve "ldm.meteo.psu.edu" to 
> an Internet address in 0.000735 seconds; hostname lookup timeout
> Oct 26 22:29:01 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Upstream LDM died
> Oct 26 22:29:01 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:01 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:29:02 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:02 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:29:04 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:04 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:29:07 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:07 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:29:12 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:12 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:29:20 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:20 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:29:33 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:33 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:29:54 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:29:54 apps1 ldm.meteo.psu.edu[7673] ERROR: Disconnecting due to LDM 
> failure; Couldn't get IP address of host ldm.meteo.psu.edu; try again later
> Oct 26 22:30:24 apps1 ldm.meteo.psu.edu[7673] NOTE: LDM-6 desired 
> product-class: 20061026220244.592 TS_ENDT {{CONDUIT,  ".*"},{NONE,  
> "SIG=ba18ed0766a89ce4bcbde902b3059409"}}
> Oct 26 22:30:24 apps1 ldm.meteo.psu.edu[7673] NOTE: Upstream LDM-6 on 
> ldm.meteo.psu.edu is willing to be a primary feeder
> 
> 
> Original feed machine serving ldm.meteo.psu.edu virtual address:
> REALSERVER iddrs3.meteo.psu.edu:
> Oct 26 15:09:35 iddrs3 apps1.meteo.psu.edu(feed)[12331] NOTE: Starting 
> Up(6.4.6.4/6): 20061026145428.587 TS_ENDT {{CONDUIT,  ".*"}}, Primary
> Oct 26 15:09:35 iddrs3 apps1.meteo.psu.edu(feed)[12331] NOTE: topo:  
> apps1.meteo.psu.edu {{CONDUIT, (.*)}}
> Oct 26 22:29:01 iddrs3 apps1.meteo.psu.edu(feed)[12331] ERROR: Couldn't flush 
> connection; nullproc_6() failure to apps1.meteo.psu.edu: RPC: Timed out
> Oct 26 22:29:03 iddrs3 rpc.ldmd[11325] NOTE: child 12331 exited with status 3
> 
> 
> Machine that served ldm.meteo.psu.edu after resuming network connectivity:
> REALSERVER idd-ingest.meteo.psu.edu (a.k.a. iddrs2):
> Oct 26 22:30:26 iddrs2 apps1.meteo.psu.edu(feed)[24145] NOTE: Starting 
> Up(6.4.5/6): 20061026222747.466 TS_ENDT {{CONDUIT,  ".*"}}, Primary
> Oct 26 22:30:26 iddrs2 apps1.meteo.psu.edu(feed)[24145] NOTE: topo:  
> apps1.meteo.psu.edu {{CONDUIT, (.*)}}
> 
> ____________________
> 
> Here's a couple lines from a "watch" log issued as follows:
> 
> pqutil -r -v -w /usr/local/ldm/data/ldm.pq >& /usr/local/ldm/logs/watch.log &
> 
> Last CONDUIT received before network stoppage on test system:
> 
> Oct 26 22:28:01 pqutil INFO:    76284 20061026220344.592 CONDUIT 158  
> data/nccf/com/ruc/prod/ruc2a.20061026/ruc2.t21z.pgrb20f12 
> !grib/ncep/RUC2/#252/200610262100/F012/VGRD/750_mb! 000158
> 
> First CONDUIT received after network resumed on test system:
> 
> Oct 26 22:43:36 pqutil INFO:   125600 20061026222818.325 CONDUIT 001  
> data/nccf/com/gfs/prod/gfs.2006102618/gfs.t18z.pgrb2f147 
> !grib2/ncep/GFS/#000/200610261800F147/HGHT/975 Pa PRES! 000001
> 
> ____________________
> 
> Here's information from pqmon:
> 
> Queue "age" at 22:30Z as reported by pqmon:
> 
> iddrs3 -> 7285 seconds
> idd-ingest (a.k.a. iddrs2) -> 7306 seconds
> 
> ____________________
> 
> So, how would you interpret the above results?  This looks to me like a real
> problem...  unless I'm missing something somewhere.  Let me know if you need
> any further information or if you'd like me to run a different test or to set
> up the test a different way.

Thanks for sending this in.  I think it might have revealed a bug dealing
with the setting of the "from" time in the sending product-class in
response to a request that encodes the signature of the last successfully-
received product.  I'm investigating and will get back to you ASAP.

> Thanks...
> 
> Art
> 
> Arthur A. Person
> Research Assistant, System Administrator
> Penn State Department of Meteorology
> email:  address@hidden, phone:  814-863-1563

Regards,
Steve Emmerson

Ticket Details
===================
Ticket ID: UYH-624598
Department: Support LDM
Priority: Normal
Status: On Hold