Operations 24 min read

How to Diagnose 502, 504 and Connection Reset Errors in Nginx‑Powered Services

This guide explains how to distinguish the root causes of 502 Bad Gateway, 504 Gateway Timeout, and Connection Reset errors in Nginx reverse‑proxy deployments and provides a step‑by‑step, four‑segment troubleshooting workflow with concrete log patterns, shell commands, and configuration tweaks.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Diagnose 502, 504 and Connection Reset Errors in Nginx‑Powered Services

Problem Background

In production, 502, 504 and Connection Reset are the three most common error types seen behind an Nginx reverse proxy. Although they are often lumped together as "backend down", each error points to a distinct failure mode:

502 Bad Gateway : backend does not respond.

504 Gateway Timeout : backend response is too slow.

Connection Reset : the connection is actively closed by the middle layer or the backend.

Using the wrong troubleshooting path (e.g., applying a 502 checklist to a 504 case) wastes time, so the article adopts Nginx as the default reverse‑proxy scenario and covers the full method from error‑signature identification to segment‑by‑segment diagnosis.

1. Distinguishing the Three Errors

1.1 Communication Chain

The request flow is:

Client → Nginx (reverse proxy) → upstream (backend service)
          ↑                ↑
   problem occurs   problem occurs
   client→Nginx      Nginx→upstream

1.2 Error‑Feature Comparison

Error   | Log keyword                | error.log example                                 | Direct cause
--------|---------------------------|---------------------------------------------------|----------------------------
502     | connect() failed           | connect() failed (111: Connection refused)        | Nginx cannot connect to upstream
502     | no live upstreams          | no live upstreams while connecting to upstream   | All upstreams unavailable
504     | upstream timed out         | upstream timed out (110: Connection timed out)     | Upstream response timeout
504     | upstream prematurely closed| upstream prematurely closed connection           | Upstream closed before finishing
Connection Reset | recv() failed            | recv() failed (104: Connection reset by peer)    | Upstream actively resets
Connection Reset | Connection reset by peer | readv() failed (104: Connection reset by peer)   | Nginx or upstream actively closes

1.3 Quick Identification

Do not rely solely on the browser’s status code because browsers may cache or misreport. Use the following commands on the Nginx host:

# 1. Show actual status codes from access.log
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10
# 2. Correlate with upstream_status (may be empty for 502)
tail -100 /var/log/nginx/access.log | grep "status=502" | head -5
# 3. Search error.log for key phrases
grep -E "connect() failed|upstream timed out|recv\(\) failed|Connection reset|no live upstreams" /var/log/nginx/error.log | tail -20

Note that $status is the code returned to the client, while $upstream_status is the code returned by the upstream; they can differ (e.g., Nginx returns 504 while the upstream actually returned 200).

2. Four‑Segment Link Diagnosis

The request path is split into four segments, and the troubleshooting principle is "segment confirmation, locate the shortest faulty board":

Segment 1: Client → Nginx network
Segment 2: Nginx itself
Segment 3: Nginx → upstream network
Segment 4: Upstream (backend service)

2.1 Local Curl Test (Segment 1)

Run curl on the Nginx machine to rule out client‑to‑Nginx network issues:

# Request Nginx locally (bypassing external network)
curl -sS -o /dev/null -w 'http_code=%{http_code} time_total=%{time_total}s time_connect=%{time_connect}s time_starttransfer=%{time_starttransfer}s
' http://127.0.0.1/health

If the local curl succeeds but external access fails → problem in Segment 1.

If the local curl also fails → problem in Segments 2‑4.

2.2 Direct Upstream Access (Segment 3‑4)

Skip Nginx and contact the backend directly:

# Direct upstream request from the Nginx host
curl -sS -o /dev/null -w "http_code=%{http_code} time_total=%{time_total}s
" http://10.0.1.10:8080/health
# If multiple upstreams, test each
for ip in 10.0.1.10 10.0.1.11 10.0.1.12; do
  echo -n "$ip: "
  curl -sS -o /dev/null -w "code=%{http_code} total=%{time_total}s
" --connect-timeout 3 --max-time 5 http://$ip:8080/health
done

If Nginx fails but direct upstream succeeds → problem in Segment 2‑3 (Nginx config or Nginx‑to‑upstream network).

If both fail → problem in Segment 4 (backend).

3. Dedicated 502 Bad Gateway Checklist

3.1 Log‑Based定位

# Find recent 502 entries in access.log
grep " 502 " /var/log/nginx/access.log | tail -5 | awk '{print $1,$4,$7}'
# Search error.log for the exact cause
grep "connect() failed" /var/log/nginx/error.log | tail -10

3.2 Common Causes & Checks

Cause A: Backend process not started or crashed

# Verify process existence
ps aux | grep -E "java|python|node|php-fpm" | grep -v grep
# Verify listening port
ss -lntp | grep 8080
# Check OOM kill
dmesg -T | grep -i "oom\|killed" | tail -5

Cause B: PHP‑FPM pool exhausted

# PHP‑FPM status page (if enabled)
curl http://127.0.0.1/status
# Or inspect PHP‑FPM logs
tail -50 /var/log/php-fpm/www-error.log

Cause C: Firewall or security‑group blocking

# Test connectivity to upstream port
telnet 10.0.1.10 8080
nc -zv 10.0.1.10 8080
# Check iptables rules
iptables -L -n | grep 8080
# Cloud provider security groups must be verified in the console

Cause D: FastCGI buffer insufficient

# Look for buffer‑related errors
grep "upstream sent too big header" /var/log/nginx/error.log
# Fix by increasing buffers
location ~ \.php$ {
  fastcgi_buffer_size 32k;
  fastcgi_buffers 8 32k;
  fastcgi_busy_buffers_size 64k;
}

3.3 502 Checklist

# 1. Is the backend process running?
ps aux | grep backend
# 2. Is the port listening?
ss -lntp
# 3. Is the firewall blocking?
iptables -L -n
# 4. Is proxy_pass correct?
grep "proxy_pass\|fastcgi_pass" /etc/nginx/conf.d/default.conf
# 5. Can the upstream hostname resolve?
nslookup backend.example.com
# 6. Does the backend expose a health‑check endpoint?
curl -I http://127.0.0.1:8080/health

4. Dedicated 504 Gateway Timeout Checklist

4.1 Timeout Configuration Overview

location /api/ {
  proxy_connect_timeout 5s;   # TCP handshake timeout
  proxy_send_timeout    10s;  # Request‑body send timeout
  proxy_read_timeout    30s;  # Wait for upstream response (most common)
  proxy_pass http://backend;
}

The most frequent root cause is an insufficient proxy_read_timeout. Nginx defaults to 60 s for all three timeouts, which may be unreasonable in production.

4.2 Step‑by‑Step Diagnosis

Step 1: Measure real upstream response time

# Use curl -w to capture timings
curl -sS -o /dev/null -w "
 time_namelookup=%{time_namelookup}s
 time_connect=%{time_connect}s
 time_starttransfer=%{time_starttransfer}s
 time_total=%{time_total}s
" http://10.0.1.10:8080/api/slow-endpoint

If time_starttransfer is large (e.g., > 30 s), the backend is slow and proxy_read_timeout must be increased.

Step 2: Look for timeout entries in error.log

grep "upstream timed out" /var/log/nginx/error.log | tail -5

Step 3: Split timeout settings per API type

# Quick API – 5 s timeout
location /api/quick/ { proxy_read_timeout 5s; proxy_pass http://backend; }
# Export – up to 120 s
location /api/export/ { proxy_read_timeout 120s; proxy_pass http://backend; }
# Long‑poll / SSE – very long timeout, disable buffering
location /api/poll/ { proxy_read_timeout 3600s; proxy_buffering off; proxy_pass http://backend; }

4.3 Common Backend Root Causes for Slow Responses

Slow SQL – check database slow‑query log.

External dependency timeout – verify external API calls have proper timeout protection.

Thread‑pool queue – monitor backend thread‑pool metrics.

Deadlock – analyze thread dumps.

Full GC – review JVM GC logs.

5. Dedicated Connection Reset Checklist

5.1 Nature of the Error

Connection Reset (104: Connection reset by peer) means one side aborts the TCP connection without completing the four‑way handshake. In Nginx this usually indicates:

Upstream actively closes the connection (most common – backend under pressure).

Nginx actively closes (timeout or connection‑pool recycle).

Intermediate network device (firewall, load balancer) drops idle connections.

5.2 Diagnosis Steps

Step 1: Identify which side reset the connection

# Search error.log for reset messages
grep "Connection reset by peer" /var/log/nginx/error.log

If the log mentions recv() failed , the upstream reset; if no reset appears in error.log but the client reports it, Nginx likely reset.

Step 2: Check upstream file‑descriptor (fd) usage and thread‑pool saturation

# On the upstream host
cat /proc/$(pidof java)/limits | grep "open files"
lsof -p $(pidof java) | wc -l
# TCP connection queue
ss -ant | grep -E 'SYN-RECV|TIME-WAIT' | wc -l
netstat -s | grep -i "listen overflow"

Step 3: Verify Nginx worker_connections

# View Nginx status (stub_status must be enabled)
curl http://127.0.0.1/nginx_status
# Example output
Active connections: 65300
Reading: 0 Writing: 128 Waiting: 45

If active connections approach worker_connections × worker_processes , Nginx itself is saturated.

Step 4: Inspect TCP backlog overflow

# Listen queue overflow count
netstat -s | grep -i "listen"
# Current backlog size
ss -lntp | grep 80
# Example: LISTEN 0 511 ... (511 is backlog)

If overflow is frequent, increase backlog and kernel limits:

listen 8080 backlog=65535;
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535

6. Nginx Configuration Optimisation Reference

6.1 Reasonable Timeout & Buffer Settings

upstream backend {
  least_conn;
  server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
  server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
  keepalive 128;
  keepalive_requests 10000;
  keepalive_timeout 60s;
}
server {
  listen 80 backlog=65535;
  server_name api.example.com;
  proxy_connect_timeout 5s;
  proxy_send_timeout    10s;
  proxy_read_timeout    30s;
  proxy_buffer_size 4k;
  proxy_buffers 8 4k;
  proxy_busy_buffers_size 8k;
  client_max_body_size 10m;
  client_body_buffer_size 128k;
  location /api/ {
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_pass http://backend;
  }
}

6.2 Log Format Including Upstream Information

log_format main_ext '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" upstream_addr=$upstream_addr upstream_status=$upstream_status upstream_response_time=$upstream_response_time request_time=$request_time';
access_log /var/log/nginx/access.log main_ext;

The variables $upstream_addr, $upstream_status, $upstream_response_time and $request_time are essential for pinpointing the failure segment.

6.3 Failover & Degradation

upstream backend {
  server 10.0.1.10:8080 max_fails=3 fail_timeout=30s;
  server 10.0.1.11:8080 max_fails=3 fail_timeout=30s;
  server 10.0.1.12:8080 backup; # standby node
}
server {
  location /api/ {
    proxy_pass http://backend;
    proxy_next_upstream error timeout http_502 http_503 http_504;
    proxy_next_upstream_tries 2;   # limit retries to avoid snowball effect
    proxy_next_upstream_timeout 5s;
    error_page 502 504 = @fallback;
  }
  location @fallback {
    internal;
    default_type application/json;
    return 200 '{"status":"degraded","message":"Service temporarily unavailable, please retry later"}';
  }
}

Do not set proxy_next_upstream_tries too high (2‑3 is recommended) to prevent overload amplification.

7. Quick‑Reference Scenarios

Occasional 502 – log shows connect() failed (111); likely backend restart. Verify startup and health‑check.

Persistent 502 – log shows no live upstreams; all upstreams down. Check every backend node.

Periodic 502 – same keyword at fixed intervals; suspect cron‑job causing load spikes. Review scheduled tasks.

Partial 504 – upstream timed out on specific API. Analyse P99 latency of that endpoint.

All 504 – backend overall overload or DB connection‑pool exhaustion. Check CPU, connection pool, slow queries.

Intermittent Connection Reset – recv() failed (104); upstream fd shortage or thread‑pool saturation. Monitor fd and thread‑pool.

Massive Connection Reset – same keyword; backend OOM or crash‑restart. Look at dmesg and backend logs.

502/504 alternating – both logs appear; backend overloaded, some requests rejected, others timed out. Inspect GC, thread‑pool, connection‑pool.

8. Production‑Environment Best Practices

Backup configuration before any change.

cp -a /etc/nginx /etc/nginx.$(date +%F_%H%M%S).bak

Reload configuration instead of full restart: nginx -t && nginx -s reload Restarting (e.g., systemctl restart nginx) interrupts active connections.

Avoid global timeout settings; tailor proxy_read_timeout per API SLA.

Disable proxy_buffering for long‑polling, SSE, etc., otherwise data is held until the buffer fills.

Be cautious with proxy_next_upstream on POST requests – it can cause duplicate submissions if the backend is not idempotent.

When you see 499, it means the client aborted; usually a browser timeout or front‑end timeout.

9. Conclusion

Diagnosing 502, 504 and Connection Reset is not about "restarting everything"; it relies on log‑first analysis and segment‑by‑segment verification:

Log first – error.log tells you the error type and which segment failed.

Layered validation – local curl → direct upstream → confirm the faulty segment.

Treat the symptom with the right remedy

502 – check backend liveness and firewall.

504 – examine backend response time and timeout settings.

Connection Reset – inspect upstream fd limits and thread‑pool saturation.

All of this presumes that Nginx logs contain the enriched fields $upstream_addr, $upstream_status and $upstream_response_time. Without them, troubleshooting efficiency drops dramatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendOperationstroubleshootingNginx502Connection Reset504
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.