JGroups Protocol FD
External
- User Manual FD http://www.jgroups.org/manual/index.html#FD
- JGroups Wiki http://community.jboss.org/wiki/JGroupsFD
- FD versus FD_SOCK http://community.jboss.org/wiki/FDVersusFDSOCK
Internal
Overview
FD is a failure detection protocol based on sending regular messages to a neighbor and waiting for replies.
Failure Detection
After the first missing heartbeat response, the initiating member send more 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered. The SUSPECT message is sent down the stack, is addressed to all members.
This is how the log reflects a heartbeat missing:
2015-06-17 15:26:21,588 DEBUG [org.jgroups.protocols.FD] (54er-4,shared=udp) server09/SchemaRepository: heartbeat missing from server26/SchemaRepository (number=1)
The number of attempts is logged as: (number=1), (number=2), ...
This is how the log reflects "suspicion":
2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-3,shared=udp) server26/sr: received no heartbeat from server24/sr for 5 times (300000 milliseconds), suspecting it 2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-2,shared=udp) server26/sr: broadcasting SUSPECT message (suspects=[server24/sr])
Note that the "received no heartbeat from [...] for [...] times ([...] milliseconds), suspecting it" log entry reports the exact amount of time the node waited to declare suspicion.
In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes timeout + timeout + max_tries * timeout = (max_tries + 2) * timeout milliseconds to detect.
Once a member is declared suspected, it will be verified by the VERIFY_SUSPECT protocol, and if it indeed is unavailable, it will be excluded by GMS.
If we use FD_SOCK instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.
Also See
Recommendations
Configuration
JGroups standalone:
<FD timeout="6000" max_tries="5"/>
WildFly:
<subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="udp"> <stack name="udp"> ... <protocol type="FD"> <property name="timeout">60000</property> </protocol> ... </stack> </subsystem>
max_tries
The default max_tries is 5 on EAP.
timeout
The timeout is 6000 ms on EAP.