Linux Network Troubleshooting: Difference between revisions
No edit summary |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
* TCP zero windows sent when application receive buffer is empty https://access.redhat.com/solutions/2480661 | * TCP zero windows sent when application receive buffer is empty https://access.redhat.com/solutions/2480661 | ||
=Network Monitoring= | =Network Monitoring= | ||
Line 19: | Line 15: | ||
On each node, run the monitor.sh script: https://access.redhat.com/articles/1311173. This script will record OS network stats at a set interval and it will allow monitoring changes over time and correlate these changes with packet capture data. | On each node, run the monitor.sh script: https://access.redhat.com/articles/1311173. This script will record OS network stats at a set interval and it will allow monitoring changes over time and correlate these changes with packet capture data. | ||
=Network Driver Error Messages= | ====Network Driver Error Messages==== | ||
grep vmxnet3 sos_commands/kernel/dmesg | grep vmxnet3 sos_commands/kernel/dmesg | ||
Line 33: | Line 29: | ||
[ 8.187854] vmxnet3 0000:0b:00.0 eno16780032: NIC Link is Up 10000 Mbps | [ 8.187854] vmxnet3 0000:0b:00.0 eno16780032: NIC Link is Up 10000 Mbps | ||
=Kernel Network Paramenters= | ====Kernel Network Paramenters==== | ||
cat etc/sysctl.conf | cat etc/sysctl.conf | ||
Line 64: | Line 60: | ||
fs.suid_dumpable = 0 | fs.suid_dumpable = 0 | ||
Also see {{Internal|Kernel Runtime Configuration|Kernel Runtime Configuration}} | |||
====Inspect packet loss==== | |||
ethtool -S | |||
awk '($NF !~ "^0$") {print}' sos_commands/networking/ethtool_-S_eno16780032 | egrep -v "[u,m,b]cast|LRO pkts rx|[LR,TS]O byte(s)?|[LR,TS]O pkts|pkts linearized" | |||
NIC statistics: | |||
Tx Queue#: 1 | |||
Tx Queue#: 2 | |||
Tx Queue#: 3 | |||
Rx Queue#: 1 | |||
pkts rx OOB: 45 | |||
drv dropped rx total: 29 | |||
err: 29 | |||
Rx Queue#: 2 | |||
Rx Queue#: 3 | |||
====RX Drops==== | |||
* ifconfig packet drops reported after upgrading to RHEL7 https://access.redhat.com/solutions/2073223 | |||
====proc/net/dev==== | |||
- | cat proc/net/dev | ||
Inter-| Receive | Transmit | |||
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed | |||
lo: 727497676 2498032 0 0 0 0 0 0 727497676 2498032 0 0 0 0 0 0 | |||
eno16780032: 216193874050 702019404 0 2658277 0 0 0 87265004 195315249141 549883330 0 0 0 0 0 0 | |||
====IP and TCP Diagnostics==== | |||
Review the IP, TCP OS protocol handler stats. | |||
Check whether IP fragmentation is occurring. This is normal behaviour when an application sends a datagram which exceeds the MTU (1500). | |||
Check number of failures due to fragment loss. TCP divides data into MSS-sized segments which should not require IP fragmentation so fragmentation is likely caused by UDP traffic. | |||
netstat -s | |||
Ip: | |||
Ip: | |||
702329554 total packets received | 702329554 total packets received | ||
0 forwarded | 0 forwarded | ||
Line 129: | Line 113: | ||
618990 fragments received ok | 618990 fragments received ok | ||
1856970 fragments created | 1856970 fragments created | ||
Check the rate of TCP retransmissions. A low rate is a sign that the network infrastructure is healthy. Problems with packet loss or high latency in the environment for any reason, reflects in a high rate of TCP retransmissions. | |||
Check for socket buffer overflows. | |||
Check for listen queue overflows. | |||
netstat -s | egrep "pruned|collapsed|overflowed" | |||
542 packets pruned from receive queue because of socket buffer overrun | 542 packets pruned from receive queue because of socket buffer overrun | ||
4 packets pruned from receive queue | 4 packets pruned from receive queue | ||
1197 packets collapsed in receive queue due to low socket buffer | 1197 packets collapsed in receive queue due to low socket buffer | ||
=Packet Capture and Analysis= | |||
<font color=red>TO Process:</font> | |||
tcpdump -s 0 -i eno16780032 -w /tmp/$HOSTNAME.pcap | |||
TZ=EST5EDT tshark -t a -r test.pcap -Y "tcp.port == 46162" | tail -n 20 | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "frame.number == 543875" | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e tcp.stream -e col.Info -Y "frame.number == 543875" | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "tcp.stream == 20 && frame.number >= 543875 && data.len == 21" | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.1 && ip.dst == 10.10.10.2" | wc -l | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.2 && ip.dst == 10.10.10.1" | wc -l | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e tcp.stream -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.1 && ip.dst == 10.10.10.2" | sort -un | wc -l | |||
The following two command lines show the tcp stream number (2nd column) and how many tcp packets with data length of 21 bytes there are on each. The 1st command line shows the communication from 10.10.10.1 to 10.10.10.2, and the 2nd command line shows the other direction: | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e tcp.stream -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.1 && ip.dst == 10.10.10.2" | sort -n | uniq --count | |||
2 20 | |||
3 21 | |||
2 80 | |||
25 86 | |||
25 87 | |||
2 5634 | |||
2 5636 | |||
2 8030 | |||
14 8065 | |||
14 8162 | |||
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e tcp.stream -Y "frame.number >= 543875 && data.len == 21 && ip.src ==10.10.10.2 && ip.dst == 10.10.10.1" | sort -n | uniq --count | |||
3 21 | |||
2 80 | |||
25 86 | |||
25 87 | |||
2 88 | |||
2 5634 | |||
2 5636 | |||
2 8030 | |||
15 8065 | |||
14 8162 |
Latest revision as of 13:57, 31 July 2017
Internal
Overview
This page needs to be reviewed and re-organized.
Organizatorium
- TCP zero windows sent when application receive buffer is empty https://access.redhat.com/solutions/2480661
Network Monitoring
On each node, run the monitor.sh script: https://access.redhat.com/articles/1311173. This script will record OS network stats at a set interval and it will allow monitoring changes over time and correlate these changes with packet capture data.
Network Driver Error Messages
grep vmxnet3 sos_commands/kernel/dmesg [ 5.731844] VMware vmxnet3 virtual NIC driver - version 1.1.30.0-k-NAPI [ 5.731858] vmxnet3 0000:0b:00.0: # of Tx queues : 4, # of Rx queues : 4 [ 5.737730] vmxnet3 0000:0b:00.0: irq 72 for MSI/MSI-X [ 5.737786] vmxnet3 0000:0b:00.0: irq 73 for MSI/MSI-X [ 5.737860] vmxnet3 0000:0b:00.0: irq 74 for MSI/MSI-X [ 5.737891] vmxnet3 0000:0b:00.0: irq 75 for MSI/MSI-X [ 5.737916] vmxnet3 0000:0b:00.0: irq 76 for MSI/MSI-X [ 5.738367] vmxnet3 0000:0b:00.0 eth0: NIC Link is Up 10000 Mbps [ 8.186233] vmxnet3 0000:0b:00.0 eno16780032: intr type 3, mode 0, 5 vectors allocated [ 8.187854] vmxnet3 0000:0b:00.0 eno16780032: NIC Link is Up 10000 Mbps
Kernel Network Paramenters
cat etc/sysctl.conf
net.ipv4.ip_forward = 0 net.ipv4.conf.default.rp_filter = 1 net.ipv4.conf.default.accept_source_route = 0 net.ipv4.conf.all.accept_redirects=0 net.ipv4.conf.default.accept_redirects=0 net.ipv4.conf.all.log_martians=1 net.ipv4.conf.default.log_martians=1 net.core.wmem_max = 12582912 net.core.rmem_max = 26214400 net.ipv4.tcp_rmem = 10240 87380 26214400 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_sack = 1 net.ipv4.tcp_no_metrics_save = 1 net.core.netdev_max_backlog = 5000 net.ipv4.conf.all.send_redirects = 0 net.ipv4.conf.default.send_redirects = 0 net.ipv6.conf.all.disable_ipv6 = 1 kernel.sysrq = 0 kernel.core_uses_pid = 1 net.ipv4.tcp_syncookies = 1 kernel.msgmnb = 65536 kernel.msgmax = 65536 kernel.shmmax = 68719476736 kernel.shmall = 4294967296 fs.suid_dumpable = 0
Also see
Inspect packet loss
ethtool -S
awk '($NF !~ "^0$") {print}' sos_commands/networking/ethtool_-S_eno16780032 | egrep -v "[u,m,b]cast|LRO pkts rx|[LR,TS]O byte(s)?|[LR,TS]O pkts|pkts linearized" NIC statistics: Tx Queue#: 1 Tx Queue#: 2 Tx Queue#: 3 Rx Queue#: 1 pkts rx OOB: 45 drv dropped rx total: 29 err: 29 Rx Queue#: 2 Rx Queue#: 3
RX Drops
- ifconfig packet drops reported after upgrading to RHEL7 https://access.redhat.com/solutions/2073223
proc/net/dev
cat proc/net/dev
Inter-| Receive | Transmit face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed lo: 727497676 2498032 0 0 0 0 0 0 727497676 2498032 0 0 0 0 0 0 eno16780032: 216193874050 702019404 0 2658277 0 0 0 87265004 195315249141 549883330 0 0 0 0 0 0
IP and TCP Diagnostics
Review the IP, TCP OS protocol handler stats.
Check whether IP fragmentation is occurring. This is normal behaviour when an application sends a datagram which exceeds the MTU (1500).
Check number of failures due to fragment loss. TCP divides data into MSS-sized segments which should not require IP fragmentation so fragmentation is likely caused by UDP traffic.
netstat -s
Ip: 702329554 total packets received 0 forwarded 0 incoming packets discarded 699283912 incoming packets delivered 550941269 requests sent out 16 dropped because of missing route 5 fragments dropped after timeout 4372810 reassemblies required 1336065 packets reassembled ok 7 packet reassembles failed 618990 fragments received ok 1856970 fragments created
Check the rate of TCP retransmissions. A low rate is a sign that the network infrastructure is healthy. Problems with packet loss or high latency in the environment for any reason, reflects in a high rate of TCP retransmissions.
Check for socket buffer overflows.
Check for listen queue overflows.
netstat -s | egrep "pruned|collapsed|overflowed"
542 packets pruned from receive queue because of socket buffer overrun 4 packets pruned from receive queue 1197 packets collapsed in receive queue due to low socket buffer
Packet Capture and Analysis
TO Process:
tcpdump -s 0 -i eno16780032 -w /tmp/$HOSTNAME.pcap
TZ=EST5EDT tshark -t a -r test.pcap -Y "tcp.port == 46162" | tail -n 20
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "frame.number == 543875"
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e frame.number -e frame.time -e ip.src -e ip.dst -e tcp.stream -e col.Info -Y "frame.number == 543875"
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "tcp.stream == 20 && frame.number >= 543875 && data.len == 21"
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.1 && ip.dst == 10.10.10.2" | wc -l
TZ=EST5EDT tshark -n -t ad -r test.pcap -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.2 && ip.dst == 10.10.10.1" | wc -l
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e tcp.stream -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.1 && ip.dst == 10.10.10.2" | sort -un | wc -l
The following two command lines show the tcp stream number (2nd column) and how many tcp packets with data length of 21 bytes there are on each. The 1st command line shows the communication from 10.10.10.1 to 10.10.10.2, and the 2nd command line shows the other direction:
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e tcp.stream -Y "frame.number >= 543875 && data.len == 21 && ip.src == 10.10.10.1 && ip.dst == 10.10.10.2" | sort -n | uniq --count 2 20 3 21 2 80 25 86 25 87 2 5634 2 5636 2 8030 14 8065 14 8162
TZ=EST5EDT tshark -n -t ad -r test.pcap -T fields -e tcp.stream -Y "frame.number >= 543875 && data.len == 21 && ip.src ==10.10.10.2 && ip.dst == 10.10.10.1" | sort -n | uniq --count 3 21 2 80 25 86 25 87 2 88 2 5634 2 5636 2 8030 15 8065 14 8162