TCP
External
- http://www.ietf.org/rfc/rfc793.txt
- http://www.frozentux.net/iptables-tutorial/iptables-tutorial.html#TCPCHARACTERISTICS
- http://www.frozentux.net/iptables-tutorial/iptables-tutorial.html#TCPCONNECTIONS
- TCP Timeout and Retransmission http://repo.hackerzvoice.net/depot_madchat/ebooks/TCP-IP_Illustrated/tcp_time.htm
- TCP/IP Illustrated, Volume 1 The Protocols W. Richard Stevens http://repo.hackerzvoice.net/depot_madchat/ebooks/TCP-IP_Illustrated/
Internal
Overview
The TCP (Transport Control Protocol) protocol resides on top of the IP protocol. It is a stateful protocol, and its primary responsibility is to make sure the data was received properly by the other host. It does so by insuring that the data is reliably received and sent, the data is transported between the Internet layer and the Application layer correctly, the packets reach the proper program in the application layer, and they do it in the right order.
TCP Life Cycle
The TCP protocol looks at data as an continuous data stream with a start and a stop signal.
Handshake
The signal that indicates that a new stream is waiting to be opened is called a SYN three-way handshake in TCP, and consists of one packet sent with the SYN bit set. The other end then either answers with SYN/ACK or SYN/RST to let the client know if the connection was accepted or denied, respectively.
If the client receives an SYN/ACK packet, it once again replies, this time with an ACK packet. At this point, the whole connection is established and data can be sent. During this initial handshake, all of the specific options that will be used throughout the rest of the TCP connection is also negotiated, such as ECN, SACK, etc.
ESTABLISHED Connection
While the datastream is alive, TCP insures that the packets are actually received properly by the other end. This is done using a Sequence number in the packet. Every time we send a packet, we give a new value to the Sequence number, and when the other end receives the packet, it sends an ACK packet back to the data sender. The ACK packet acknowledges that the packet was received properly. The sequence number also sees to it that the packet is inserted into the data stream in a good order.
A established connection that is smoothly exchanging data is said to be in sync.
This is how the exchange looks like:
Duration of a TCP Connection. When one side of the TCP connection wants to close the connection, it sends an RST packet. For more details see Closing the Connection below. Until that happens, both sides will leave their socket opened indefinitely, regardless on whether there is application traffic or not. More interestingly, one side may simply die without having the time to signal anything, leaving the other state to think the connection is still alive. This situation is referred to as a stale connection. If the remaining side sends traffic, and it does not get acknowledgements, it'll eventually figure out that something is wrong. However, TCP has a mechanism to detect this situation even in the absence of application traffic.
Another interesting situation for a TCP connection that does not send traffic and does not have the keep-alive on is that the network interface on one of the sides can be shut down, replaced with another one, configured with the same address, and restarted, while the TCP connection stays "valid" - only if there's no application traffic over it. Once this theoretical sequences completes, sending data on one side will arrive at the other side.
TCP Keep-Alive: The TCP protocol has a way of determining whether the other end of a connection is still alive or not, even if there is no application layer traffic to related to. This mechanism is called TCP Keep-Alive. TCP Keep-Alive configuration, normal operation and exceptional cases are described here:
Closing the Connection
Once the connection is closed, this is done by sending a FIN packet from either end-point. The other end then responds by sending a FIN/ACK packet. The FIN-sending end can then no longer send any data, but the other end-point can still finish sending data. Once the second end-point wishes to close the connection totally, it sends a FIN packet back to the originally closing end-point, and the other end-point replies with a FIN/ACK packet. Once this whole procedure is done, the connection is torn down properly.
TIME_WAIT State
By default, a connection is supposed to stay in the TIME_WAIT state for twice the msl. Its purpose is to make sure any lost packets that arrive after a connection is closed do not confuse the TCP subsystem. The default msl is 60 seconds, which puts the default TIME_WAIT timeout value at 2 minutes.
TCP Segment
A TCP packet is a singular unit sent over the network containing headers and a data portion. TCP packets are also called segment.
TCP Headers
Source Port
Bits 0 - 15. The source port of the packet. The source port was originally bound directly to a process on the sending system. Currently, a hash between the IP addresses, and both the destination and source ports is used.
Destination Port
Bits 16 - 31.
Sequence Number
Bits 32 - 63. The sequence number field is used to set a number on each TCP packet so that the TCP stream can be properly sequenced. This is the mechanism used to insure that the packets are processed in the correct order. The Sequence Number is then returned in the ACK field to acknowledge that the packet was properly received.
Acknowledgment Number
Bits 64 - 95. This field is used when we acknowledge a specific packet a host has received. For example, we receive a packet with one Sequence number set, and if everything is OK with the packet, we reply with an ACK packet with the Acknowledgment number set to the same as the original Sequence number.
Data Offset
Bits 96 - 99. This field indicates how long the TCP header is, and where the Data part of the packet actually starts. It is set with 4 bits, and measures the TCP header in 32 bit words. The header should always end at an even 32 bit boundary, even with different options set. This is possible thanks to the Padding field at the very end of the TCP header.
CWR
Bit 104. This bit was added in RFC 3268 and is used by ECN. CWR stands for Congestion Window Reduced, and is used by the data sending part to inform the receiving part that the congestion window has been reduced. When the congestion window is reduced, we send less data per time unit, to be able to cope with the total network load.
ECE
Bit 105. This bit was also added with RFC 3268 and is used by ECN. ECE stands for ECN Echo. It is used by the TCP/IP stack on the receiver host to let the sending host know that it has received an CE packet.
URG
Bit 106. This field tells us if we should use the Urgent Pointer field or not. If set to 0, do not use Urgent Pointer, if set to 1, do use Urgent pointer.
ACK
Bit 107. This bit is set to a packet to indicate that this is in reply to another packet that we received, and that contained data. An Acknowledgment packet is always sent to indicate that we have actually received a packet, and that it contained no errors. If this bit is set, the original data sender will check the Acknowledgment Number to see which packet is actually acknowledged, and then dump it from the buffers.
PSH
Bit 108. The PUSH flag is used to tell the TCP protocol on any intermediate hosts to send the data on to the actual user, including the TCP implementation on the receiving host. This will push all data through, regardless of where or how much of the TCP Window that has been pushed through yet.
RST
Bit 109. The RESET flag is set to tell the other end to tear down the TCP connection. This is done in a couple of different scenarios, the main reasons being that the connection has crashed for some reason, if the connection does not exist, or if the packet is wrong in some way.
SYN
Bit 110. The SYN (or Synchronize sequence numbers) is used during the initial establishment of a connection. It is set in two instances of the connection, the initial packet that opens the connection, and the reply SYN/ACK packet. It should never be used outside of those instances.
FIN
Bit 111. The FIN bit indicates that the host that sent the FIN bit has no more data to send. When the other end sees the FIN bit, it will reply with a FIN/ACK. Once this is done, the host that originally sent the FIN bit can no longer send any data. However, the other end can continue to send data until it is finished, and will then send a FIN packet back, and wait for the final FIN/ACK, after which the connection is sent to a CLOSED state.
Window
Bit 112 - 127. The Window field is used by the receiving host to tell the sender how much data the receiver permits at the moment. This is done by sending an ACK back, which contains the Sequence number that we want to acknowledge, and the Window field then contains the maximum accepted sequence numbers that the sending host can use before he receives the next ACK packet. The next ACK packet will update accepted Window which the sender may use.
Checksum
Bit 128 - 143. This field contains the checksum of the whole TCP header. It is a one's complement of the one's complement sum of each 16 bit word in the header. If the header does not end on a 16 bit boundary, the additional bits are set to zero. While the checksum is calculated, the checksum field is set to zero. The checksum also covers a 96 bit pseudoheader containing the Destination-, Source-address, protocol, and TCP length. This is for extra security.
Urgent Pointer
Bit 144 - 159. This is a pointer that points to the end of the data which is considered urgent. If the connection has important data that should be processed as soon as possible by the receiving end, the sender can set the URG flag and set the Urgent pointer to indicate where the urgent data ends.
Options
Bit 160 - **. The Options field is a variable length field and contains optional headers that we may want to use. Basically, this field contains 3 subfields at all times. An initial field tells us the length of the Options field, a second field tells us which options are used, and then we have the actual options. A complete listing of all the TCP Options can be found in TCP options.
Padding
Bit **. The padding field pads the TCP header until the whole header ends at a 32-bit boundary. This ensures that the data part of the packet begins on a 32-bit boundary, and no data is lost in the packet. The padding always consists of only zeros.
Retransmission
Retransmission Timeout (RTO)
A retransmission timeout (RTO) occurs when the sender is missing too many acknowledgments and decides to take a time out and stop sending altogether. After some amount of time, usually at least one second, the sender cautiously starts sending again, one packet at first, then two packets, and so on. As a result, an RTO causes, at minimum, a one-second delay on your network.
tcp_retries1 (R1)
This integer influences the time, after which TCP decides, that something is wrong due to unacknowledged RTO retransmissions, and reports this suspicion to the network layer. RFC 1122 recommends at least 3 retransmissions, which is the default.
tcp_retries2 (R2)
This integer influences the timeout of an alive TCP connection, when RTO retransmissions remain unacknowledged. Given a value of N, a hypothetical TCP connection following exponential backoff with an initial RTO of TCP_RTO_MIN would retransmit N times before killing the connection at the (N+1)th RTO.
The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout. RFC 1122 recommends at least 100 seconds for the timeout, which corresponds to a value of at least 8.
This is how RFC 1122 describes interaction between R1 and R2: Excessive retransmission of the same segment by TCP indicates some failure of the remote host or the Internet path. This failure may be of short or long duration. The following procedure is used to handle excessive retransmissions of data segments:
(a) There are two thresholds R1 and R2 measuring the amount of retransmission that has occurred for the same segment. R1 and R2 might be measured in time units or as a count of retransmissions.
(b) When the number of transmissions of the same segment reaches or exceeds threshold R1, pass negative advice to the IP layer, to trigger dead-gateway diagnosis.
(c) When the number of transmissions of the same segment reaches a threshold R2 greater than R1, close the connection.
(d) An application MUST be able to set the value for R2 for a particular connection.
(e) TCP SHOULD inform the application of the delivery problem, when R1 is reached and before R2.
In Linux, these are kernel parameters:
net.ipv4.tcp_retries1 = 3 net.ipv4.tcp_retries2 = 15
More about setting Linux Kernel Parameters [Linux Kernel#LinuxKernelVariables].