Operations 9 min read

Why a Healthy Frontend Still Returns 504 Errors: An MTU Mismatch Case Study

A production incident showed that despite flawless frontend health metrics and no logged errors, a subset of users experienced 504 Gateway Timeout errors caused by an MTU mismatch in the network path, highlighting the need for end‑to‑end connectivity checks beyond application monitoring.

Xiao Liu Lab
Xiao Liu Lab
Xiao Liu Lab
Why a Healthy Frontend Still Returns 504 Errors: An MTU Mismatch Case Study

Even when services appear healthy—no logs, no alerts—some users may still be unable to load pages. This article recounts a real‑world production outage where a SaaS web console (HTTPS + Nginx + backend API) showed 95% normal traffic but about 5% of users, concentrated in a specific ISP region, repeatedly received 504 Gateway Timeout errors.

Fault Background

Architecture: User → CDN → Alibaba Cloud SLB (public) → Nginx cluster → application services. All monitoring metrics (CPU, memory, request count, error rate) were normal, and Nginx access logs showed that the problematic requests never reached the server, indicating the issue lay before the SLB or within the network path.

Phase 1 – Eliminate Application Layer (L7)

Local curl -v https://console.example.com/login succeeded in 300 ms. Remote tests from Beijing and Shanghai cloud instances also succeeded, while a physical machine in a Shenzhen IDC hung and timed out after 10 seconds. The problem was therefore tied to the client’s network environment rather than server code or configuration.

Phase 2 – Check Transport Layer (L4)

On the Shenzhen IDC host, tcpdump -i eth0 host console.example.com -nn captured repeated SYN packets with no SYN‑ACK responses, confirming that the TCP three‑way handshake was stuck at the first step.

Phase 3 – Inspect Intermediate Devices

Traceroute/MTR revealed the first 12 hops were normal, but the 13th hop (a provincial backbone node) exhibited up to 80% packet loss, especially for large packets.

Insight – MTU Mismatch

The suspicion was that a link in the path had an MTU smaller than the standard 1500 bytes, causing fragmentation‑related drops.

Phase 4 – Verify MTU

Using ping -M do -s 1472 console.example.com (payload 1472 bytes = 1500 bytes total) succeeded, while ping -M do -s 1473 failed with “Message too long” on the Shenzhen IDC host. The same test on a Beijing ECS succeeded, confirming a smaller MTU on the problematic path.

Phase 5 – Why HTTPS Was Affected

The TLS ClientHello packet often exceeds 1400 bytes. Modern browsers set the DF (Don’t Fragment) flag, so if the path MTU is below the packet size and the network does not support PMTUD, the packet is dropped, causing the TLS handshake to fail and the TCP connection never to be established. Plain HTTP requests are smaller and usually fit within a single packet, so they are less likely to trigger the issue.

Solutions

Temporary workaround (not recommended): lower the client MTU, e.g. ip link set dev eth0 mtu 1400, only for verification.

Recommended fix: enable MSS clamping on the SLB or Nginx layer to ensure packets respect the smallest MTU in the path.

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

Or set a specific MSS value:

iptables -A OUTPUT -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300

Key Lessons

"Service healthy" ≠ "User reachable"

– monitoring must cover end‑to‑end experience.

504 errors can stem from network‑level issues, not just slow backends.

MTU problems are common in hybrid‑cloud or multi‑ISP scenarios, especially across old or cross‑border links.

Path MTU Discovery often fails; proactive MSS clamping provides a reliable safety net.

Quick MTU Diagnostic Script

#!/bin/bash
HOST=${1:-example.com}

echo "Testing MTU path to $HOST..."

for size in 1472 1400 1300 1200; do
    echo -n "Ping size $size: "
    if ping -W 2 -c 1 -M do -s $size $HOST >/dev/null; then
        echo "OK"
    else
        echo "FAIL"
    fi
done

Run with ./mtu_check.sh console.example.com to quickly identify the largest payload that traverses the path without fragmentation.

Effective operations require not only fixing known errors but also uncovering hidden issues that appear error‑free on the surface.

operationsNetwork TroubleshootingMTUTCP handshakediagnostic script504 timeoutMSS clamping
Xiao Liu Lab
Written by

Xiao Liu Lab

An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.