Why a Healthy Frontend Still Returns 504 Errors: An MTU Mismatch Case Study
A production incident showed that despite flawless frontend health metrics and no logged errors, a subset of users experienced 504 Gateway Timeout errors caused by an MTU mismatch in the network path, highlighting the need for end‑to‑end connectivity checks beyond application monitoring.
Even when services appear healthy—no logs, no alerts—some users may still be unable to load pages. This article recounts a real‑world production outage where a SaaS web console (HTTPS + Nginx + backend API) showed 95% normal traffic but about 5% of users, concentrated in a specific ISP region, repeatedly received 504 Gateway Timeout errors.
Fault Background
Architecture: User → CDN → Alibaba Cloud SLB (public) → Nginx cluster → application services. All monitoring metrics (CPU, memory, request count, error rate) were normal, and Nginx access logs showed that the problematic requests never reached the server, indicating the issue lay before the SLB or within the network path.
Phase 1 – Eliminate Application Layer (L7)
Local curl -v https://console.example.com/login succeeded in 300 ms. Remote tests from Beijing and Shanghai cloud instances also succeeded, while a physical machine in a Shenzhen IDC hung and timed out after 10 seconds. The problem was therefore tied to the client’s network environment rather than server code or configuration.
Phase 2 – Check Transport Layer (L4)
On the Shenzhen IDC host, tcpdump -i eth0 host console.example.com -nn captured repeated SYN packets with no SYN‑ACK responses, confirming that the TCP three‑way handshake was stuck at the first step.
Phase 3 – Inspect Intermediate Devices
Traceroute/MTR revealed the first 12 hops were normal, but the 13th hop (a provincial backbone node) exhibited up to 80% packet loss, especially for large packets.
Insight – MTU Mismatch
The suspicion was that a link in the path had an MTU smaller than the standard 1500 bytes, causing fragmentation‑related drops.
Phase 4 – Verify MTU
Using ping -M do -s 1472 console.example.com (payload 1472 bytes = 1500 bytes total) succeeded, while ping -M do -s 1473 failed with “Message too long” on the Shenzhen IDC host. The same test on a Beijing ECS succeeded, confirming a smaller MTU on the problematic path.
Phase 5 – Why HTTPS Was Affected
The TLS ClientHello packet often exceeds 1400 bytes. Modern browsers set the DF (Don’t Fragment) flag, so if the path MTU is below the packet size and the network does not support PMTUD, the packet is dropped, causing the TLS handshake to fail and the TCP connection never to be established. Plain HTTP requests are smaller and usually fit within a single packet, so they are less likely to trigger the issue.
Solutions
Temporary workaround (not recommended): lower the client MTU, e.g. ip link set dev eth0 mtu 1400, only for verification.
Recommended fix: enable MSS clamping on the SLB or Nginx layer to ensure packets respect the smallest MTU in the path.
iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtuOr set a specific MSS value:
iptables -A OUTPUT -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300Key Lessons
"Service healthy" ≠ "User reachable"– monitoring must cover end‑to‑end experience.
504 errors can stem from network‑level issues, not just slow backends.
MTU problems are common in hybrid‑cloud or multi‑ISP scenarios, especially across old or cross‑border links.
Path MTU Discovery often fails; proactive MSS clamping provides a reliable safety net.
Quick MTU Diagnostic Script
#!/bin/bash
HOST=${1:-example.com}
echo "Testing MTU path to $HOST..."
for size in 1472 1400 1300 1200; do
echo -n "Ping size $size: "
if ping -W 2 -c 1 -M do -s $size $HOST >/dev/null; then
echo "OK"
else
echo "FAIL"
fi
doneRun with ./mtu_check.sh console.example.com to quickly identify the largest payload that traverses the path without fragmentation.
Effective operations require not only fixing known errors but also uncovering hidden issues that appear error‑free on the surface.
Xiao Liu Lab
An operations lab passionate about server tinkering 🔬 Sharing automation scripts, high-availability architecture, alert optimization, and incident reviews. Using technology to reduce overtime and experience to avoid major pitfalls. Follow me for easier, more reliable operations!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
