[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[LDM #HEQ-649192]: LDM fault tolerance


> Your clarifications help a lot. Though I still have some questions about
> machine failures and where a replacement node would 'start' at in the data.

When a downstream LDM starts, it requests products starting from some time in 
the past (typically one hour) or from the last successfully-received product 
(which it tracks) -- whichever is most recent. Thus, if the downstream site is 
offline for less than one hour or the minimum residency time of the upstream 
site's product-queue (whichever is less), then no products will be lost.

> Allow me to elaborate on my usecase: I'm planning to download large amounts
> of weather data, spreading the load across many 'nodes.' Nodes in this case
> are AWS instances. At the scale we're looking at, instance ('machine'
> failure) with loss of any LDM state is to be expected relatively
> frequently. If there's some piece of LDM non-memory state that needs to be
> persisted between machine failures to guarantee delivery, I need to be
> aware.
> I am evaluating, if we run LDM naively what are our failure conditions when
> - we lose a machine, all it's local state, and disk
> - an LDM process dies
> - network partitions or failures between or during transfers
> And how we might structure our LDM cluster to avoid any related problems.
> I'm also looking to better understand the implementation of LDM so I can be
> aware of our upstream providers (paid Universities) potential failure cases
> and how they impact our cluster's ability to always successfully receive
> and process files. (ignoring network partitions > ~45mins, failure of more
> than some set number of redundant nodes, and the data being unavailable to
> the LDM network)
> So, basically, trying to figure out how strong the processing guarantees
> that LDM provides are so I know where I need to add extra
> monitoring/coordination between redundant nodes.

Sounds like you might be interested in the section on LDM clusters in the 
reference manual 

Steve Emmerson

Ticket Details
Ticket ID: HEQ-649192
Department: Support LDM
Priority: Normal
Status: Closed

NOTE: All email exchanges with Unidata User Support are recorded in the Unidata inquiry tracking system and then made publicly available through the web. If you do not want to have your interactions made available in this way, you must let us know in each email you send to us.