JGroups Protocol FD ALL: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
=External=
* User Manual FD_ALL http://www.jgroups.org/manual/html/protlist.html#d0e5626
=Internal=
=Internal=


Line 4: Line 8:


=Recommendations=
=Recommendations=
!!!Internal
|[VERIFY_SUSPECT]
|[FD]
|[FD_SOCK]
!!!Overview
Implements failure detection based on simple heartbeat protocol. Every member periodically multicasts a heartbeat <font color=red>using the underlying transport, not a parallel mechanism</font>. Every member also maintains a table of all members (minus itself). When data or a heartbeat from a pair are received, the member resets the timestamp for that pair to the current time. Periodically, the member checks for expired members, and suspects those.
Once a member has been suspected, a SUSPECT_MESSAGE is sent up the stack.
The protocol has two asynchronous processing threads: one that periodically sends multicast heartbeats and one that periodically analyzes the suspect table and raises SUSPECT events up the stack, when the suspect table is not empty.
!!!Configuration Sample
{{{
        <FD_ALL interval="1000" timeout="3000"/>
}}}
!!!Configuration
!!interval
The periodicity (in ms) of the HEARTBEAT message is sent by a member to the cluster __AND__ with which the response timestamps are checked. In the example above, each member sends a heartbeat and check the response timestamps for previous heartbeats every second. If at any moment the difference between a specific HEARTBEAT event timestap and the response timestamp from a member is larger than 'timeout', that member is suspected.
So, for the values defined above, if heartbeat H1 doesn't receive a response, the remote member will be suspected after 3 * interval + interval (~4,000 ms). If the timeout is set to 2,500 ms, the member will be suspected after 3,000 ms.
!!timeout
Timeout after which a node is suspected by the current node if neither a heartbeat nor data were received from. Also see the 'interval' definition above.
!!ergonomics
Enables ergonomics: dynamically find the best values for properties at runtime
!!level
Sets the logger level (see javadocs)
!!msg_counts_as_heartbeat
Treat messages received from members as heartbeats. Note that this means we're updating a value in a hashmap every time a message is passing up the stack through FD_ALL, which is costly. Default is false.
!!stats
Determines whether to collect statistics (and expose them via JMX). Default is true
!!!Recommendations
''Resulted from personal experimentation''
Both [FD_SOCK] and [FD] failure detection protocols, which rely on directly pinging a neighbor, have proved unreliable under certain platform-specific circumstances. If you plan to use those, test response times with your specific JVM and platform. They may work very well on certain platforms and fail on others. I would use FD_ALL as a central failure detection protocol, its coverage seems to be most generic.
__Referenced by:__\\
[{INSERT com.ecyrd.jspwiki.plugin.ReferringPagesPlugin WHERE max=20, maxwidth=50}]

Revision as of 22:42, 1 March 2016

External

Internal

Recommendations

!!!Internal


|[VERIFY_SUSPECT] |[FD] |[FD_SOCK]


!!!Overview

Implements failure detection based on simple heartbeat protocol. Every member periodically multicasts a heartbeat using the underlying transport, not a parallel mechanism. Every member also maintains a table of all members (minus itself). When data or a heartbeat from a pair are received, the member resets the timestamp for that pair to the current time. Periodically, the member checks for expired members, and suspects those.

Once a member has been suspected, a SUSPECT_MESSAGE is sent up the stack.

The protocol has two asynchronous processing threads: one that periodically sends multicast heartbeats and one that periodically analyzes the suspect table and raises SUSPECT events up the stack, when the suspect table is not empty.


!!!Configuration Sample

{{{

       <FD_ALL interval="1000" timeout="3000"/>

}}}

!!!Configuration

!!interval

The periodicity (in ms) of the HEARTBEAT message is sent by a member to the cluster __AND__ with which the response timestamps are checked. In the example above, each member sends a heartbeat and check the response timestamps for previous heartbeats every second. If at any moment the difference between a specific HEARTBEAT event timestap and the response timestamp from a member is larger than 'timeout', that member is suspected.

So, for the values defined above, if heartbeat H1 doesn't receive a response, the remote member will be suspected after 3 * interval + interval (~4,000 ms). If the timeout is set to 2,500 ms, the member will be suspected after 3,000 ms.


!!timeout


Timeout after which a node is suspected by the current node if neither a heartbeat nor data were received from. Also see the 'interval' definition above.


!!ergonomics

Enables ergonomics: dynamically find the best values for properties at runtime


!!level

Sets the logger level (see javadocs)

!!msg_counts_as_heartbeat

Treat messages received from members as heartbeats. Note that this means we're updating a value in a hashmap every time a message is passing up the stack through FD_ALL, which is costly. Default is false.

!!stats

Determines whether to collect statistics (and expose them via JMX). Default is true

!!!Recommendations

Resulted from personal experimentation

Both [FD_SOCK] and [FD] failure detection protocols, which rely on directly pinging a neighbor, have proved unreliable under certain platform-specific circumstances. If you plan to use those, test response times with your specific JVM and platform. They may work very well on certain platforms and fail on others. I would use FD_ALL as a central failure detection protocol, its coverage seems to be most generic.


__Referenced by:__\\ [{INSERT com.ecyrd.jspwiki.plugin.ReferringPagesPlugin WHERE max=20, maxwidth=50}]