JGroups Protocol FD: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
=External=
=External=


* User Manual FD http://www.jgroups.org/manual/index.html#FD
* JGroups Wiki http://community.jboss.org/wiki/JGroupsFD
* JGroups Wiki http://community.jboss.org/wiki/JGroupsFD
* FD versus FD_SOCK http://community.jboss.org/wiki/FDVersusFDSOCK
* FD versus FD_SOCK http://community.jboss.org/wiki/FDVersusFDSOCK

Revision as of 22:08, 1 March 2016

External

Internal


!!!Internal


|[JGroups] |[VERIFY_SUSPECT] |[FD_SOCK] |[FD_ALL#Recommendations]

!!!Overview

Failure detection based on heartbeat messages.

A member sends 'are-you-alive' messages with a periodicity of 'timeout' milliseconds. These are "FD" messages containing a HEARTBEAT header.

The messages are sent to the neighbor to its right, identified by the position in the view. For an [[A, B, C] view, B will send heartbeat messages to C and C will send heartbeat messages to A.

When the neighbor receives the HEARTBEAT, it replies with a message containing a "FD" - HEARTBEAT_ACK header. When HEARTBEAT_ACK is received the timestamp is set to current time and the counter is set it to 0.

!!!Failure Detection

After the first missing heartbeat response, the initiating member send more 'max_tries' heartbeat messages and the target member is declared suspect only after all heartbeat messages go unanswered. The SUSPECT message is sent down the stack, is addressed to all members.

This is how the log reflects a heartbeat missing:

{{{ 2015-06-17 15:26:21,588 DEBUG [org.jgroups.protocols.FD] (54er-4,shared=udp) server09/SchemaRepository: heartbeat missing from server26/SchemaRepository (number=1) }}}

The number of attempts is logged as: Template:(number=1), Template:(number=2), ...

This is how the log reflects "suspicion":

{{{ 2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-3,shared=udp) server26/sr: received no heartbeat from server24/sr for 5 times (300000 milliseconds), suspecting it 2015-06-17 15:31:06,727 DEBUG [org.jgroups.protocols.FD] (54er-2,shared=udp) server26/sr: broadcasting SUSPECT message (suspects=[server24/sr]) }}}

In the worst case, when the target member dies immediately after answering a heartbeat, the failure takes Template:Timeout + timeout + max tries * timeout = (max tries + 2) * timeout milliseconds to detect.

Once a member is declared suspected it will be excluded by GMS, also subject to [VERIFY_SUSPECT] layer.

If we use FD_SOCK instead, then we don't send heartbeats, but establish TCP sockets and declare a member dead only when a socket is closed.

!!!Configuration Sample

Plain:

{{{

   <FD timeout="6000" max_tries="5"/>

}}}

EAP:

{{{

           <subsystem xmlns="urn:jboss:domain:jgroups:1.1" default-stack="udp">
               <stack name="udp">
                   ...
                   <protocol type="FD">
                       <property name="timeout">
                           60000
                       </property>
                   </protocol>
                   ...
               </stack>
           </subsystem>

}}}


!!!Default Values

The default max_tries is 5 on EAP.

The timeout is 6000 ms on EAP.

!!!Recommendations

See [FD_ALL Recommendations|FD_ALL#Recommendations]

__Referenced by:__\\ [{INSERT com.ecyrd.jspwiki.plugin.ReferringPagesPlugin WHERE max=20, maxwidth=50}]