Linux Network Troubleshooting

From NovaOrdis Knowledge Base
Revision as of 13:45, 31 July 2017 by Ovidiu (talk | contribs)
Jump to navigation Jump to search

Internal

Overview

This page needs to be reviewed and re-organized.

Organizatorium

Packet Capture and Analysis

tcpdump -s 0 -i eno16780032 -w /tmp/$HOSTNAME.pcap

Network Monitoring

On each node, run the monitor.sh script: https://access.redhat.com/articles/1311173. This script will record OS network stats at a set interval and it will allow monitoring changes over time and correlate these changes with packet capture data.

Network Driver Error Messages

grep vmxnet3 sos_commands/kernel/dmesg 
[    5.731844] VMware vmxnet3 virtual NIC driver - version 1.1.30.0-k-NAPI
[    5.731858] vmxnet3 0000:0b:00.0: # of Tx queues : 4, # of Rx queues : 4
[    5.737730] vmxnet3 0000:0b:00.0: irq 72 for MSI/MSI-X
[    5.737786] vmxnet3 0000:0b:00.0: irq 73 for MSI/MSI-X
[    5.737860] vmxnet3 0000:0b:00.0: irq 74 for MSI/MSI-X
[    5.737891] vmxnet3 0000:0b:00.0: irq 75 for MSI/MSI-X
[    5.737916] vmxnet3 0000:0b:00.0: irq 76 for MSI/MSI-X
[    5.738367] vmxnet3 0000:0b:00.0 eth0: NIC Link is Up 10000 Mbps
[    8.186233] vmxnet3 0000:0b:00.0 eno16780032: intr type 3, mode 0, 5 vectors allocated
[    8.187854] vmxnet3 0000:0b:00.0 eno16780032: NIC Link is Up 10000 Mbps

Kernel Network Paramenters

cat etc/sysctl.conf 
net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.conf.all.accept_redirects=0
net.ipv4.conf.default.accept_redirects=0
net.ipv4.conf.all.log_martians=1
net.ipv4.conf.default.log_martians=1
net.core.wmem_max = 12582912
net.core.rmem_max = 26214400
net.ipv4.tcp_rmem = 10240 87380 26214400
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 5000
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv6.conf.all.disable_ipv6 = 1
kernel.sysrq = 0
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
fs.suid_dumpable = 0

Also see

Kernel Runtime Configuration

Inspect packet loss

ethtool -S
awk '($NF !~ "^0$") {print}' sos_commands/networking/ethtool_-S_eno16780032 | egrep -v "[u,m,b]cast|LRO pkts rx|[LR,TS]O byte(s)?|[LR,TS]O pkts|pkts linearized"
NIC statistics:
     Tx Queue#: 1
     Tx Queue#: 2
     Tx Queue#: 3
     Rx Queue#: 1
       pkts rx OOB: 45
       drv dropped rx total: 29
          err: 29
     Rx Queue#: 2
     Rx Queue#: 3

RX Drops

proc/net/dev

cat proc/net/dev

Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
    lo: 727497676 2498032    0    0    0     0          0         0 727497676 2498032    0    0    0     0       0          0
eno16780032: 216193874050 702019404    0 2658277    0     0          0  87265004 195315249141 549883330    0    0    0     0       0          0

IP and TCP Diagnostics

Review the IP, TCP OS protocol handler stats.

Check whether IP fragmentation is occurring. This is normal behaviour when an application sends a datagram which exceeds the MTU (1500).

Check number of failures due to fragment loss. TCP divides data into MSS-sized segments which should not require IP fragmentation so fragmentation is likely caused by UDP traffic.

netstat -s
Ip:
   702329554 total packets received
   0 forwarded
   0 incoming packets discarded
   699283912 incoming packets delivered
   550941269 requests sent out
   16 dropped because of missing route
   5 fragments dropped after timeout
   4372810 reassemblies required
   1336065 packets reassembled ok
   7 packet reassembles failed
   618990 fragments received ok
   1856970 fragments created


Check the rate of TCP retransmissions. A low rate is a sign that the network infrastructure is healthy. Problems with packet loss or high latency in the environment for any reason, reflects in a high rate of TCP retransmissions.

Check for socket buffer overflows.

Check for listen queue overflows.

netstat -s | egrep "pruned|collapsed|overflowed" 
   542 packets pruned from receive queue because of socket buffer overrun
   4 packets pruned from receive queue
   1197 packets collapsed in receive queue due to low socket buffer