Nginx Troubleshooting Handbook: Analyzing 502, 504 and Connection Timeouts Step by Step
This guide walks through a systematic, four‑layer analysis of Nginx 502, 504 and connection‑timeout failures, showing how to split the request path, collect logs and metrics, verify upstream health, adjust timeouts, and apply best‑practice configurations to quickly locate and resolve production issues.
Overview
Nginx often serves as the entry and reverse‑proxy layer. HTTP 502 usually means the upstream returned an invalid response, refused the connection, or the upstream process crashed. HTTP 504 indicates the request reached the upstream but timed out. Connection‑timeout problems may involve the client‑Nginx link, the Nginx‑upstream link, or kernel queues.
Instead of guessing from the error code, split the request path into four segments: client → Nginx → upstream → kernel/network . Identifying the failing segment speeds up troubleshooting.
Detailed Steps
1. Preparation
Verify Nginx processes are running and configuration syntax is correct.
Determine which error code dominates (502, 504, or timeout).
Check whether the issue affects all sites or a specific server / location.
Identify whether all upstream nodes fail or only a subset.
Inspect connection counts, retransmissions, and backlog for anomalies.
date
hostname -f
nginx -V
nginx -t
systemctl status nginx --no-pager
ss -lntp | grep ':80\|:443'
ss -s
curl -I -m 3 http://127.0.0.1/
tail -n 50 /var/log/nginx/error.log
tail -n 50 /var/log/nginx/access.log2. Locate the Failing Segment
Compare a local request to Nginx with a direct request to the upstream:
curl -sS -o /dev/null -w 'code=%{http_code} connect=%{time_connect} start=%{time_starttransfer} total=%{time_total}
' http://127.0.0.1/api/health
curl -sS -o /dev/null -w 'code=%{http_code} connect=%{time_connect} start=%{time_starttransfer} total=%{time_total}
' http://<upstream-ip>:<port>/health
ss -ant state syn-recv
ss -ant state time-wait | wc -lInterpretation:
If the local request succeeds but external requests time out, check firewall, L4 load balancer, certificates, or network ACLs.
If both Nginx and direct upstream return 502, the problem lies in the upstream.
If Nginx returns 504 while direct upstream is slow, the upstream processing time is the issue.
If Nginx is slow but direct upstream is normal, suspect Nginx configuration, DNS, connection reuse, or kernel queues.
3. Check Upstream Configuration and Timeouts
Key parameters that often cause incidents:
upstream order_service {
least_conn;
server 10.10.20.31:8080 max_fails=3 fail_timeout=10s;
server 10.10.20.32:8080 max_fails=3 fail_timeout=10s;
keepalive 128;
}
server {
listen 80 reuseport backlog=65535;
server_name api.example.com;
access_log /var/log/nginx/api.access.log main_ext;
error_log /var/log/nginx/api.error.log warn;
location / {
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 2s;
proxy_send_timeout 10s;
proxy_read_timeout 15s;
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;
proxy_pass http://order_service;
}
}Important notes: proxy_connect_timeout – time to establish a connection to upstream; avoid setting it too large. proxy_read_timeout – time Nginx waits for upstream response; should match the SLA of each API. keepalive – size of the upstream connection pool; too small increases connection cost, too large can exhaust upstream sockets. backlog – listen queue length; a small value drops connections under traffic spikes. proxy_next_upstream_tries – retry count; keep it low (e.g., 2) to prevent snowballing failures.
4. Log‑Based Diagnosis
Search error logs for key patterns to pinpoint the layer:
grep -E "connect\(\) failed|upstream timed out|no live upstreams|recv\(\) failed|connection refused|broken pipe|reset by peer" /var/log/nginx/error.log | tail -100 connect() failed (111: Connection refused)– upstream not listening, process crashed, or firewall blocked. upstream timed out (110: Connection timed out) – request reached upstream but did not finish within the timeout window. no live upstreams – all upstream nodes are unavailable, usually due to health‑check failures. recv() failed (104: Connection reset by peer) – upstream closed the connection, often because of thread‑pool exhaustion or FD limits.
5. Real‑World Cases
Case 1 – 502 Caused by Upstream Process Crash
Scenario: a payment gateway saw the 502 error rate jump from 0.02 % to 6 %.
error.log contained many connect() failed (111: Connection refused) entries.
Port 8080 stopped listening intermittently.
Upstream JVM was killed by the OOM killer.
Actions:
Remove the faulty upstream from the load balancer.
Restart the service and verify the port is listening.
Add monitoring for JVM heap, connection pool, and thread pool.
Result: 502 rate dropped from 6.1 % to 0.05 % within five minutes, and the payment callback backlog was gradually cleared.
Case 2 – 504 Caused by Slow SQL
Scenario: during a marketing campaign, product‑list API returned many 504s. proxy_read_timeout was set to 15 s; the business team wanted to increase it to 60 s.
Analysis:
error.log showed upstream timed out.
Direct upstream health check also took 18‑25 s.
MySQL showed full‑table scans and long lock‑wait times.
Remediation:
Rate‑limit the heavy interface.
Kill the slow SQL statements and add missing indexes.
Result: 504 rate fell to near zero and P99 latency dropped from 22.7 s to 320 ms.
6. Best Practices and Pitfalls
Log everything : include request_time, upstream_response_time, upstream_status and upstream_addr in the access log to quickly locate the offending upstream.
Separate timeout settings per API : do not use a single proxy_read_timeout for all endpoints; payment, search, export, and upload have very different latency profiles.
Limit retry attempts : set proxy_next_upstream_tries to 2 or less; higher values amplify snowball effects during upstream slowness.
Secure the status page : expose stub_status only to internal IP ranges.
Guard against large requests and slow clients : configure client_max_body_size, client_header_timeout, client_body_timeout and send_timeout.
Audit config changes : version‑control Nginx configs and record reload timestamps, change tickets, and owners.
7. Monitoring and Alerting
Key metrics to watch: nginx_http_requests_total and 5xx ratio. nginx_connections_active/reading/writing/waiting.
Upstream response‑time P99.
Active connections vs. worker_connections and FD usage.
SYN‑RECV count – rising values hint at backlog shortage or attacks.
File‑descriptor usage – alert when >85 % of the limit.
# Example Prometheus rule for 502 rate
groups:
- name: nginx-5xx
rules:
- alert: Nginx502RateHigh
expr: |
sum(rate(nginx_http_requests_total{status="502"}[1m])) by (instance) /
sum(rate(nginx_http_requests_total[1m])) by (instance) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Nginx 502 error rate high"
description: "{{ $labels.instance }} 502 ratio > 1% for 2 minutes"8. Backup and Recovery
Before any configuration change, back up the entire /etc/nginx directory:
#!/usr/bin/env bash
set -euo pipefail
TARGET="/var/backups/nginx-config-$(date +%F_%H%M%S).tar.gz"
tar czf "$TARGET" /etc/nginx
echo "backup saved to $TARGET"Recovery steps:
Pause further changes (e.g., pause CI/CD releases).
Restore the backup or roll back via Git.
Run nginx -t to verify syntax and certificates.
Reload with nginx -s reload and re‑check 200/502/504 metrics.
9. Conclusion
502 and 504 have different root causes – connection failures vs. upstream timeouts.
Always split the request path into client, Nginx, upstream, and kernel layers before diagnosing.
Log keywords such as connect() failed, upstream timed out, and reset by peer are essential evidence.
Do not blindly increase timeout values; fix upstream performance, database queries, or thread‑pool limits instead.
Connection‑timeout problems often involve worker_connections, somaxconn, tcp_max_syn_backlog and FD limits.
Comprehensive log fields and metric collection dramatically reduce mean‑time‑to‑resolution.
Reference Repositories
GitHub: https://github.com/raymond999999
Gitee: https://gitee.com/raymond9
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
