JGroups Protocol FD ALL

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

External

Internal

Overview

FD_ALL is a multicast based failure detection protocol.

Failure Detection

FD_ALL implements failure detection based on simple heartbeat protocol. Every member periodically multicasts a heartbeat using the underlying transport, not a parallel mechanism. Every member also maintains a table of all members (minus itself). When data or a heartbeat from a pair are received, the member resets the timestamp for that pair to the current time. Periodically, the member checks for expired members, and suspects those.

Once a member has been suspected, a SUSPECT_MESSAGE is sent up the stack.

The protocol has two asynchronous processing threads: one that periodically sends multicast heartbeats and one that periodically analyzes the suspect table and raises SUSPECT events up the stack, when the suspect table is not empty.

Also See

JGroups Events - Failure Detection

Configuration

JGroups standalone

  <FD_ALL interval="1000" timeout="3000"/>

interval

The periodicity (in ms) of the HEARTBEAT message is sent by a member to the cluster AND with which the response timestamps are checked. In the example above, each member sends a heartbeat and check the response timestamps for previous heartbeats every second. If at any moment the difference between a specific HEARTBEAT event timestap and the response timestamp from a member is larger than 'timeout', that member is suspected.

So, for the values defined above, if heartbeat H1 doesn't receive a response, the remote member will be suspected after 3 * interval + interval (~4,000 ms). If the timeout is set to 2,500 ms, the member will be suspected after 3,000 ms.

timeout

Timeout after which a node is suspected by the current node if neither a heartbeat nor data were received from. Also see the 'interval' definition above.

ergonomics

Enables ergonomics: dynamically find the best values for properties at runtime

level

Sets the logger level (see javadocs)

msg_counts_as_heartbeat

Treat messages received from members as heartbeats. Note that this means we're updating a value in a hashmap every time a message is passing up the stack through FD_ALL, which is costly. Default is false.

stats

Determines whether to collect statistics (and expose them via JMX). Default is true

Recommendations

Resulted from personal experimentation

Both FD_SOCK and FD failure detection protocols, which rely on directly pinging a neighbor, have proved unreliable under certain platform-specific circumstances. If you plan to use those, test response times with your specific JVM and platform. They may work very well on certain platforms and fail on others. I would use FD_ALL as a central failure detection protocol, its coverage seems to be most generic.