JGroups Protocol FD ALL: Difference between revisions
(Created page with "=Internal= * JGroups") |
|||
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=External= | |||
* User Manual FD_ALL http://www.jgroups.org/manual/html/protlist.html#FD_ALL | |||
=Internal= | =Internal= | ||
* [[JGroups#Protocols|JGroups]] | * [[JGroups#Protocols|JGroups]] | ||
=Overview= | |||
FD_ALL is a multicast based failure detection protocol. | |||
=Failure Detection= | |||
FD_ALL implements failure detection based on simple heartbeat protocol. Every member periodically multicasts a heartbeat <font color=red>using the underlying transport, not a parallel mechanism</font>. Every member also maintains a table of all members (minus itself). When data or a heartbeat from a pair are received, the member resets the timestamp for that pair to the current time. Periodically, the member checks for expired members, and suspects those. | |||
Once a member has been suspected, a SUSPECT_MESSAGE is sent up the stack. | |||
The protocol has two asynchronous processing threads: one that periodically sends multicast heartbeats and one that periodically analyzes the suspect table and raises SUSPECT events up the stack, when the suspect table is not empty. | |||
==Also See== | |||
<blockquote style="background-color: #f9f9f9; border: solid thin lightgrey;"> | |||
:[[JGroups Failure Detection|JGroups Events - Failure Detection]] | |||
</blockquote> | |||
=Configuration= | |||
==JGroups standalone== | |||
<pre> | |||
<FD_ALL interval="1000" timeout="3000"/> | |||
</pre> | |||
==interval== | |||
The periodicity (in ms) of the HEARTBEAT message is sent by a member to the cluster '''AND''' with which the response timestamps are checked. In the example above, each member sends a heartbeat and check the response timestamps for previous heartbeats every second. If at any moment the difference between a specific HEARTBEAT event timestap and the response timestamp from a member is larger than 'timeout', that member is suspected. | |||
So, for the values defined above, if heartbeat H1 doesn't receive a response, the remote member will be suspected after <tt>3 * interval + interval</tt> (~4,000 ms). If the timeout is set to 2,500 ms, the member will be suspected after 3,000 ms. | |||
==timeout== | |||
Timeout after which a node is suspected by the current node if neither a heartbeat nor data were received from. Also see the '[[#interval|interval]]' definition above. | |||
==ergonomics== | |||
Enables ergonomics: dynamically find the best values for properties at runtime | |||
==level== | |||
Sets the logger level (see javadocs) | |||
==msg_counts_as_heartbeat== | |||
Treat messages received from members as heartbeats. Note that this means we're updating a value in a hashmap every time a message is passing up the stack through FD_ALL, which is costly. Default is false. | |||
==stats== | |||
Determines whether to collect statistics (and expose them via JMX). Default is true | |||
=Recommendations= | |||
''Resulted from personal experimentation'' | |||
Both [[JGroups Protocol FD_SOCK|FD_SOCK]] and [[JGroups Protocol FD|FD]] failure detection protocols, which rely on directly pinging a neighbor, have proved unreliable under certain platform-specific circumstances. If you plan to use those, test response times with your specific JVM and platform. They may work very well on certain platforms and fail on others. I would use FD_ALL as a central failure detection protocol, its coverage seems to be most generic. |
Latest revision as of 03:07, 3 March 2016
External
- User Manual FD_ALL http://www.jgroups.org/manual/html/protlist.html#FD_ALL
Internal
Overview
FD_ALL is a multicast based failure detection protocol.
Failure Detection
FD_ALL implements failure detection based on simple heartbeat protocol. Every member periodically multicasts a heartbeat using the underlying transport, not a parallel mechanism. Every member also maintains a table of all members (minus itself). When data or a heartbeat from a pair are received, the member resets the timestamp for that pair to the current time. Periodically, the member checks for expired members, and suspects those.
Once a member has been suspected, a SUSPECT_MESSAGE is sent up the stack.
The protocol has two asynchronous processing threads: one that periodically sends multicast heartbeats and one that periodically analyzes the suspect table and raises SUSPECT events up the stack, when the suspect table is not empty.
Also See
Configuration
JGroups standalone
<FD_ALL interval="1000" timeout="3000"/>
interval
The periodicity (in ms) of the HEARTBEAT message is sent by a member to the cluster AND with which the response timestamps are checked. In the example above, each member sends a heartbeat and check the response timestamps for previous heartbeats every second. If at any moment the difference between a specific HEARTBEAT event timestap and the response timestamp from a member is larger than 'timeout', that member is suspected.
So, for the values defined above, if heartbeat H1 doesn't receive a response, the remote member will be suspected after 3 * interval + interval (~4,000 ms). If the timeout is set to 2,500 ms, the member will be suspected after 3,000 ms.
timeout
Timeout after which a node is suspected by the current node if neither a heartbeat nor data were received from. Also see the 'interval' definition above.
ergonomics
Enables ergonomics: dynamically find the best values for properties at runtime
level
Sets the logger level (see javadocs)
msg_counts_as_heartbeat
Treat messages received from members as heartbeats. Note that this means we're updating a value in a hashmap every time a message is passing up the stack through FD_ALL, which is costly. Default is false.
stats
Determines whether to collect statistics (and expose them via JMX). Default is true
Recommendations
Resulted from personal experimentation
Both FD_SOCK and FD failure detection protocols, which rely on directly pinging a neighbor, have proved unreliable under certain platform-specific circumstances. If you plan to use those, test response times with your specific JVM and platform. They may work very well on certain platforms and fail on others. I would use FD_ALL as a central failure detection protocol, its coverage seems to be most generic.