Operations 35 min read

Full‑Stack Network Packet‑Loss Diagnosis: From ping to tcpdump

This comprehensive guide walks operations engineers through the full stack of network packet‑loss troubleshooting on Linux, covering symptom identification, layer‑by‑layer analysis, key metrics, step‑by‑step commands, common scenarios, advanced tuning techniques, monitoring alerts and FAQs.

Ops Community
Ops Community
Ops Community
Full‑Stack Network Packet‑Loss Diagnosis: From ping to tcpdump

Problem Background

Operations engineers often encounter packet loss causing SSH stalls, web timeouts, database replication delays, cross‑datacenter transfer slowdown, video RTP glitches, and K8s pod communication loss.

Root Causes Across Layers

Loss can occur at the physical layer (cables, NIC, switch), data‑link layer (CRC errors, VLAN tags), network layer (routing black‑holes, MTU mismatches, TTL exhaustion), transport layer (TCP congestion, full connection‑track table) and application layer (Nginx backlog, keepalive settings).

Tools

ping, traceroute, mtr, ss, netstat, sar, ip, ethtool, dmesg, tcpdump, dropwatch, conntrack, nstat and related utilities.

Core Concepts

Physical vs Logical Loss

Physical loss appears in NIC counters such as RX errors, TX errors, RX dropped and switch port statistics.

Logical loss appears inside the protocol stack (e.g., IP InDiscards, TCP RetransSegs, conntrack drops) and must be inspected via /proc/net/* files.

Recommended Investigation Order

Application layer : check business logs, metrics and connection state.

Transport layer : examine TCP state, socket buffers and conntrack usage.

Network layer : review routing, MTU and Netfilter rules.

Data‑link layer : view NIC statistics with ethtool and ip -s link.

Physical layer : inspect cables, optics and port LEDs.

Do not assume a layer; verify each step.

Key Metrics

RTT (Round‑Trip Time)

Packet‑loss rate

Retransmission rate (RetransSegs / OutSegs)

Out‑of‑order rate

Duplicate ACKs

Bandwidth utilization

Step‑by‑Step Diagnosis

Step 1 – Confirm Symptom and Scope

Determine direction (up/down), intermittent vs continuous, per‑connection vs all, and source/destination IPs. Use: ping -c 100 -i 0.2 <target IP> Note loss percentage and RTT statistics.

Step 2 – Locate Loss with mtr

Run: mtr -r -c 100 <target IP> or TCP mode to bypass ICMP rate‑limiting:

mtr --tcp --port 443 -r -c 100 <target IP>

Watch for a sudden jump in Loss% and monotonic increase across hops.

Step 3 – TCP/UDP Statistics

Read protocol‑stack counters:

cat /proc/net/snmp
cat /proc/net/netstat

or use nstat -az. Important fields include Tcp_RetransSegs, Tcp_OutSegs, UdpInDatagrams, UdpInErrors.

Step 4 – NIC Statistics

Show per‑interface counters: ip -s link show <iface> Inspect driver‑level details: ethtool <iface> Key counters: RX errors, RX dropped, TX errors, Speed, Duplex. Adjust ring buffer if needed:

ethtool -G <iface> rx 4096 tx 4096

Step 5 – Socket State

List connections and detailed socket info:

ss -tan
ss -tan -i -e

Fields such as cwnd, rtt, retrans, lost reveal per‑socket health.

Step 6 – Conntrack Table

Check current usage and limits:

cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max

High insert_failed or drop counters indicate saturation; increase the limit:

sysctl -w net.netfilter.nf_conntrack_max=1048576

Step 7 – IP‑Layer Drops

Inspect IP statistics: cat /proc/net/snmp | grep Ip Non‑zero InDiscards or InHdrErrors point to kernel or Netfilter loss.

Step 8 – Netfilter / iptables

Show rule counters: iptables -L -nvx Growing pkts on a -j DROP or -j REJECT rule indicates intentional dropping.

Step 9 – Kernel Ring‑Buffer Drops

Read softnet statistics: cat /proc/net/softnet_stat Column 2 (drop count) increasing means softnet backlog overflow; raise the backlog limit:

sysctl -w net.core.netdev_max_backlog=65535

Step 10 – Packet Capture

Capture traffic for deeper analysis:

tcpdump -i eth0 -nn -s 0 -w client.pcap host <remote IP>
tcpdump -i eth0 -nn -s 0 -w server.pcap port <service port>

Open the pcap in Wireshark and filter for retransmissions, duplicate ACKs, out‑of‑order packets, or TCP flags.

Step 11 – dropwatch (optional)

Install and run:

yum install dropwatch -y
dropwatch -l kas
dropwatch> start

Output shows kernel functions where drops occur, e.g., tcp_v4_rcv or nf_conntrack_in.

Step 12 – perf / ftrace (advanced)

Trace skb free events to pinpoint exact drop locations:

perf record -e skb:kfree_skb -a -g sleep 10
perf script

Requires kernel debugging knowledge but yields precise call stacks.

Common Scenarios

UDP Loss

Check UDP counters: nstat -az | grep Udp Typical causes: receive buffer full, slow application reads, firewall DROP, IP fragmentation.

TCP Loss

High retransmission rate (RetransSegs/OutSegs > 5 %) often stems from network congestion, socket buffer limits, remote window shrink, or middle‑box drops.

Cross‑Datacenter Loss

Long RTT and high BDP; enable BBR for better bandwidth utilization:

sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq

Also enlarge socket buffers.

K8s Pod Network Loss

Issues arise from veth pairs, conntrack saturation, MTU mismatch, or complex iptables rules. Inspect: ip -s link show veth* and adjust nf_conntrack_max or MTU.

DNS Resolution Loss

Debug with:

dig +trace example.com
 dig @8.8.8.8 example.com

Failures usually caused by firewall blocking UDP 53 or rate‑limiting.

Advanced Topics

TCP BBR

Enable BBR (requires kernel 4.9+): sysctl -w net.ipv4.tcp_congestion_control=bbr Provides higher throughput on high‑delay links.

TCP Buffer Auto‑Tuning

Parameters net.ipv4.tcp_rmem and net.ipv4.tcp_wmem define min/default/max. Auto‑tuning is on by default ( tcp_moderate_rcvbuf=1).

IRQ Affinity & RSS

View multi‑queue configuration: ethtool -l <iface> Adjust CPU masks via /proc/irq/*/smp_affinity or enable the irqbalance service.

Timestamp & Latency Breakdown

RTT consists of serialization, propagation, processing and queueing delays. Use ping for total RTT and mtr for per‑hop latency.

MTU & Fragmentation

Detect path MTU: tracepath <target IP> Or test with a non‑fragmenting ping: ping -M do -s 1472 <target IP> If the large ping fails while a smaller one succeeds, the path MTU is below 1500 bytes.

Monitoring & Alerting

Prometheus alert examples (NIC drops, TCP retransmission rate, conntrack usage, softnet drops) are provided to continuously watch the key indicators listed above.

FAQ

Q1: ping shows no loss but TCP times out – ICMP may be allowed while TCP is filtered or the conntrack table is full.

Q2: Is a 1 % loss rate serious? It depends on the workload: real‑time media is sensitive, bulk HTTP less so, and financial transactions require near‑zero loss.

Q3: Different loss reports on each side – caused by capture location, middle‑box buffering, or asymmetric routing.

Q4: iperf OK but application slow – indicates application‑layer bottlenecks, TLS handshake overhead, or many small requests.

Q5: Distinguish physical vs network loss – check NIC error counters for physical loss; check /proc/net/snmp InDiscards for network‑layer loss.

Summary

Packet‑loss troubleshooting proceeds from application symptoms down through transport, network, data‑link and physical layers, using a defined toolchain (ping, mtr, ss, ethtool, conntrack, tcpdump, dropwatch, perf). Key metrics, common scenarios, advanced tuning (BBR, buffers, IRQ) and monitoring alerts are provided to locate and remediate loss efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringnetwork troubleshootingLinuxconntracksysctltcpdumpmtrPacket loss
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.