JGroups Protocol FD: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
No edit summary
 
(11 intermediate revisions by the same user not shown)
Line 11: Line 11:
=Overview=
=Overview=


FD implement a failure detection mechanism based on heartbeat messages. A member sends 'are-you-alive' messages with a periodicity of 'timeout' milliseconds. These are "FD" messages containing a HEARTBEAT header. The messages are sent to the neighbor to the right, identified by the position in the view. For an [A, B, C] view, B will send heartbeat messages to C and C will send heartbeat messages to A.
FD is a failure detection protocol based on sending regular messages to a neighbor and waiting for replies.
 
=Failure Detection=
 
FD implement a failure detection mechanism based on heartbeat messages. A member sends 'are-you-alive' messages with a periodicity of 'timeout' milliseconds. These are "FD" messages containing a HEARTBEAT header. The messages are sent to the neighbor to the right, identified by the position in the view. For an [A, B, C] view, A will send messages to B, B will send heartbeat messages to C and C will send heartbeat messages to A.


When the neighbor receives the HEARTBEAT, it replies with a message containing a "FD" - HEARTBEAT_ACK header. When HEARTBEAT_ACK is received the timestamp is set to current time and the counter is set it to 0.
When the neighbor receives the HEARTBEAT, it replies with a message containing a "FD" - HEARTBEAT_ACK header. When HEARTBEAT_ACK is received the timestamp is set to current time and the counter is set it to 0.
!!!Failure Detection


After the ''first'' missing heartbeat response, the initiating member send ''more'' 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered. The SUSPECT message is sent down the stack, is addressed to all members.
After the ''first'' missing heartbeat response, the initiating member send ''more'' 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered. The SUSPECT message is sent down the stack, is addressed to all members.
Line 21: Line 23:
This is how the log reflects a heartbeat missing:
This is how the log reflects a heartbeat missing:


{{{
<pre>
2015-06-17 15:26:21,588 DEBUG [org.jgroups.protocols.FD] (54er-4,shared=udp) server09/SchemaRepository: heartbeat missing from server26/SchemaRepository (number=1)
2015-06-17 15:26:21,588 DEBUG [org.jgroups.protocols.FD] (54er-4,shared=udp) server09/SchemaRepository: heartbeat missing from server26/SchemaRepository (number=1)
}}}
</pre>


The number of attempts is logged as: {{(number=1)}}, {{(number=2)}}, ...
The number of attempts is logged as: <tt>(number=1)</tt>, <tt>(number=2)</tt>, ...


This is how the log reflects "suspicion":
This is how the log reflects "suspicion":


{{{
<pre>
2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-3,shared=udp) server26/sr: received no heartbeat from server24/sr for 5 times (300000 milliseconds), suspecting it
2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-3,shared=udp) server26/sr: received no heartbeat from server24/sr for 5 times (300000 milliseconds), suspecting it
2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-2,shared=udp) server26/sr: broadcasting SUSPECT message (suspects=[server24/sr])
2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-2,shared=udp) server26/sr: broadcasting SUSPECT message (suspects=[server24/sr])
}}}
</pre>


In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes {{timeout + timeout + max_tries * timeout = (max_tries + 2) * timeout}} milliseconds to detect.  
Note that the "<tt>received no heartbeat from [...] for [...] times ([...] milliseconds), suspecting it"</tt> log entry reports the exact amount of time the node waited to declare suspicion.


Once a member is declared suspected it will be excluded by GMS, also subject to [VERIFY_SUSPECT] layer.  
In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes <tt>timeout + timeout + max_tries * timeout = (max_tries + 2) * timeout</tt> milliseconds to detect.  


If we use FD_SOCK instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.
Once a member is declared suspected, it will be verified by the [[JGroups Protocol VERIFY_SUSPECT|VERIFY_SUSPECT]] protocol, and if it indeed is unavailable, it will be excluded by [[JGroups Protocol GMS|GMS]].  


!!!Configuration Sample
If we use [[JGroups Protocol FD_SOCK|FD_SOCK]] instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.


Plain:
==Also See==


{{{
<blockquote style="background-color: #f9f9f9; border: solid thin lightgrey;">
:[[JGroups Failure Detection|JGroups Events - Failure Detection]]
</blockquote>


    <FD timeout="6000" max_tries="5"/>
=Recommendations=
 
<blockquote style="background-color: #f9f9f9; border: solid thin lightgrey;">
:[[JGroups Protocol FD_ALL#Recommendations|FD_ALL Recommendations]]
</blockquote>


}}}
=Configuration=


EAP:
JGroups standalone:


{{{
<pre>
    <FD timeout="6000" max_tries="5"/>
</pre>


            <subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="udp">
WildFly:
                <stack name="udp">
                    ...
                    <protocol type="FD">
                        <property name="timeout">
                            60000
                        </property>
                    </protocol>
                    ...
                </stack>
            </subsystem>
}}}


<pre>
<subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="udp">
  <stack name="udp">
      ...
      <protocol type="FD">
        <property name="timeout">60000</property>
      </protocol>
      ...
  </stack>
</subsystem>
</pre>


!!!Default Values
==max_tries==


The default max_tries is 5 on EAP.
The default max_tries is 5 on EAP.
==timeout==


<font color=red>The timeout is 6000 ms on EAP.</font>
<font color=red>The timeout is 6000 ms on EAP.</font>
!!!Recommendations
See [FD_ALL Recommendations|FD_ALL#Recommendations]
__Referenced by:__\\
[{INSERT com.ecyrd.jspwiki.plugin.ReferringPagesPlugin WHERE max=20, maxwidth=50}]

Latest revision as of 03:08, 3 March 2016

External

Internal

Overview

FD is a failure detection protocol based on sending regular messages to a neighbor and waiting for replies.

Failure Detection

FD implement a failure detection mechanism based on heartbeat messages. A member sends 'are-you-alive' messages with a periodicity of 'timeout' milliseconds. These are "FD" messages containing a HEARTBEAT header. The messages are sent to the neighbor to the right, identified by the position in the view. For an [A, B, C] view, A will send messages to B, B will send heartbeat messages to C and C will send heartbeat messages to A.

When the neighbor receives the HEARTBEAT, it replies with a message containing a "FD" - HEARTBEAT_ACK header. When HEARTBEAT_ACK is received the timestamp is set to current time and the counter is set it to 0.

After the first missing heartbeat response, the initiating member send more 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered. The SUSPECT message is sent down the stack, is addressed to all members.

This is how the log reflects a heartbeat missing:

2015-06-17 15:26:21,588 DEBUG [org.jgroups.protocols.FD] (54er-4,shared=udp) server09/SchemaRepository: heartbeat missing from server26/SchemaRepository (number=1)

The number of attempts is logged as: (number=1), (number=2), ...

This is how the log reflects "suspicion":

2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-3,shared=udp) server26/sr: received no heartbeat from server24/sr for 5 times (300000 milliseconds), suspecting it
2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-2,shared=udp) server26/sr: broadcasting SUSPECT message (suspects=[server24/sr])

Note that the "received no heartbeat from [...] for [...] times ([...] milliseconds), suspecting it" log entry reports the exact amount of time the node waited to declare suspicion.

In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes timeout + timeout + max_tries * timeout = (max_tries + 2) * timeout milliseconds to detect.

Once a member is declared suspected, it will be verified by the VERIFY_SUSPECT protocol, and if it indeed is unavailable, it will be excluded by GMS.

If we use FD_SOCK instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.

Also See

JGroups Events - Failure Detection

Recommendations

FD_ALL Recommendations

Configuration

JGroups standalone:

    <FD timeout="6000" max_tries="5"/>

WildFly:

<subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="udp">
   <stack name="udp">
      ...
      <protocol type="FD">
         <property name="timeout">60000</property>
      </protocol>
      ...
   </stack>
</subsystem>

max_tries

The default max_tries is 5 on EAP.

timeout

The timeout is 6000 ms on EAP.