Mastering Nginx 502/504 Errors: A Complete Troubleshooting Guide with Scripts
This comprehensive guide explains the differences between Nginx 502 and 504 errors, provides step‑by‑step troubleshooting procedures, detailed configuration examples, one‑click diagnostic scripts, real‑world case studies, best‑practice optimizations, monitoring setups, and advanced learning paths to help you quickly resolve gateway issues and improve server reliability.
Overview
502 and 504 are the two most common Nginx gateway errors. 502 Bad Gateway means the backend service is unavailable, crashed, or returned invalid data. 504 Gateway Timeout means the backend responded too slowly and exceeded Nginx's timeout limits.
Applicable Scenarios
PHP‑FPM (LNMP stack)
Java / Go / Python services
Load‑balanced upstreams
WebSocket long‑connection scenarios
Environment Requirements
Nginx ≥ 1.14 (mainstream versions)
OS: CentOS 7+ / Ubuntu 18.04+
Backend: PHP‑FPM, Tomcat, custom services (examples focus on PHP‑FPM)
502 Diagnosis
Step 1 – Check Backend Service
# PHP‑FPM status
systemctl status php-fpm
ps aux | grep php-fpm | grep -v grep
# Verify listening socket (default 9000 for PHP‑FPM)
ss -tlnp | grep 9000
# If using a Unix socket
ls -la /run/php-fpm/www.sockStep 2 – Inspect Nginx error.log
# Real‑time view
tail -f /var/log/nginx/error.log
# Filter 502‑related messages
grep -E "502|upstream|connect|failed" /var/log/nginx/error.logCommon error messages and meanings: connect() failed (111: Connection refused) – backend not started or wrong port. connect() failed (113: No route to host) – network unreachable. upstream prematurely closed connection – backend closed the connection (OOM, crash). no live upstreams – all upstream nodes are down.
Step 3 – PHP‑FPM Specific Checks (LNMP)
# View PHP‑FPM status page (must be enabled)
curl http://127.0.0.1/php-fpm-status
# Count active processes
ps aux | grep "php-fpm: pool" | grep -v grep | wc -l
# Check PHP‑FPM logs
tail -100 /var/log/php-fpm/www-error.logEnable the status page in /etc/php-fpm.d/www.conf:
pm.status_path = /php-fpm-statusStep 4 – Connection & Resource Limits
# Open file descriptors
cat /proc/sys/fs/file-nr
# Nginx worker connections
ss -s
# Connections on a specific port (e.g., 9000)
ss -ant | grep :9000 | wc -l
# System limits
ulimit -n
cat /etc/security/limits.conf | grep -v "^#"504 Diagnosis
Step 1 – Verify Timeout Settings
# Show all timeout directives in Nginx config
grep -r "timeout" /etc/nginx/ | grep -v "#"
# Typical timeout parameters (default 60s)
proxy_connect_timeout 60s;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
fastcgi_connect_timeout 60s;
fastcgi_read_timeout 60s;
fastcgi_send_timeout 60s;Step 2 – Analyse Backend Response Time
# Enable detailed logging in nginx.conf
log_format detailed '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';
access_log /var/log/nginx/access.log detailed;Key fields: $request_time – total time Nginx spent on the request. $upstream_connect_time – time to connect to the backend. $upstream_header_time – time to receive response headers. $upstream_response_time – time to receive the full response.
Step 3 – Identify Slow Backend Causes
# Find requests taking >5 seconds
awk '$NF > 5 {print $0}' /var/log/nginx/access.log | tail -20
# List top 20 slow URLs
awk -F'rt=' '{if(NF>1){split($2,a," ");if(a[1]>5)print $0}}' /var/log/nginx/access.log | sort -nr | head -20Typical reasons for 504:
Slow SQL queries (enable MySQL slow‑query log).
External API latency.
Code dead‑loops or blocking operations.
Resource lock contention.
Timeout values too short.
Sample Configuration
Nginx Optimisation
# /etc/nginx/nginx.conf
user nginx;
worker_processes auto; # match CPU cores
worker_rlimit_nofile 65535;
error_log /var/log/nginx/error.log warn;
pid /run/nginx.pid;
events {
worker_connections 65535;
use epoll;
multi_accept on;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" rt=$request_time uct="$upstream_connect_time" uht="$upstream_header_time" urt="$upstream_response_time"';
access_log /var/log/nginx/access.log main;
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
keepalive_requests 1000;
gzip on;
gzip_min_length 1k;
gzip_comp_level 4;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml;
upstream php_backend {
server unix:/run/php-fpm/www.sock;
keepalive 16;
}
upstream api_backend {
least_conn;
server 192.168.1.10:8080 weight=5 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 weight=5 max_fails=3 fail_timeout=30s;
keepalive 32;
}
include /etc/nginx/conf.d/*.conf;
}PHP‑FPM Optimisation
# /etc/php-fpm.d/www.conf
[www]
user = nginx
group = nginx
listen = /run/php-fpm/www.sock
listen.owner = nginx
listen.group = nginx
listen.mode = 0660
pm = dynamic
pm.max_children = 100
pm.start_servers = 20
pm.min_spare_servers = 10
pm.max_spare_servers = 30
pm.max_requests = 500
pm.process_idle_timeout = 10s
pm.status_path = /php-fpm-status
ping.path = /php-fpm-ping
ping.response = pong
slowlog = /var/log/php-fpm/www-slow.log
request_slowlog_timeout = 3s
request_terminate_timeout = 120s
php_admin_value[error_log] = /var/log/php-fpm/www-error.log
php_admin_flag[log_errors] = onOne‑Click Diagnosis Script
#!/bin/bash
# nginx_diagnose.sh – quick 502/504 check
# Usage: bash nginx_diagnose.sh
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
echo -e "${YELLOW}=== Nginx status ===${NC}"
if systemctl is-active --quiet nginx; then
echo -e "${GREEN}[OK] Nginx running${NC}"
else
echo -e "${RED}[ERROR] Nginx not running${NC}"
fi
nginx -t 2>&1 | head -5
echo -e "${YELLOW}=== Backend status ===${NC}"
if command -v php-fpm >/dev/null; then
if systemctl is-active --quiet php-fpm; then
echo -e "${GREEN}[OK] PHP‑FPM running${NC}"
fpm_count=$(ps aux | grep "php-fpm: pool" | grep -v grep | wc -l)
echo " PHP‑FPM processes: $fpm_count"
else
echo -e "${RED}[ERROR] PHP‑FPM not running${NC}"
fi
fi
for port in 9000 8080 3000 5000; do
ss -tlnp | grep -q ":$port" && echo -e "${GREEN}[OK] Port $port listening${NC}"
done
echo -e "${YELLOW}=== Recent 502/504 errors ===${NC}"
if [ -f /var/log/nginx/error.log ]; then
error_count=$(grep -c "502\|504\|upstream" /var/log/nginx/error.log 2>/dev/null || echo 0)
echo "Recent error count: $error_count"
echo "Last 10 errors:"
grep -E "502|504|upstream|connect" /var/log/nginx/error.log | tail -10
fi
echo -e "${YELLOW}=== Connection statistics ===${NC}"
ss -s
ss -ant | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn
echo -e "${YELLOW}=== System resources ===${NC}"
echo "CPU usage:"; top -bn1 | head -5
echo "Memory usage:"; free -h
echo "Disk usage:"; df -h | grep -v tmpfs
echo -e "${YELLOW}=== Nginx timeout settings ===${NC}"
grep -r "timeout" /etc/nginx/ | grep -v "#" | head -20
echo -e "${YELLOW}=== Upstream snippets ===${NC}"
grep -r "upstream" /etc/nginx/ | grep -v "#" | head -20
echo -e "${YELLOW}=== Slow requests (>3s) ===${NC}"
if [ -f /var/log/nginx/access.log ]; then
awk -F'rt=' '{if(NF>1){split($2,a," ");if(a[1]>3)print $0}}' /var/log/nginx/access.log | tail -10
fi
echo "========================================"
echo -e "${GREEN}Diagnosis complete. Review highlighted items above.${NC}"
echo "========================================"Best Practices & Caveats
Do not set timeouts excessively long. A 5‑minute proxy_read_timeout hides performance problems; adjust to realistic limits.
Process count is not “more is better”. Each PHP‑FPM process consumes 20‑50 MiB; over‑provisioning can cause OOM.
Use nginx -s reload for configuration changes. Reload preserves existing connections; restart drops them.
Always test configuration with nginx -t before reloading.
Monitoring & Alerting
Enable stub_status and query http://127.0.0.1/nginx_status to monitor active, reading, writing, and waiting connections.
Prometheus alert examples for high 5xx rate and slow requests (P95 > 3 s):
# High 5xx error rate
- alert: NginxHighErrorRate
expr: rate(nginx_http_requests_total{status=~"5.."}[5m]) / rate(nginx_http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Nginx 5xx error rate too high"
description: "5xx error rate exceeds 5% (current: {{ $value }})"
# Slow requests (P95 > 3s)
- alert: NginxSlowRequests
expr: histogram_quantile(0.95, rate(nginx_http_request_duration_seconds_bucket[5m])) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Nginx request latency high"
description: "P95 request latency > 3 seconds"Quick Reference Commands
nginx -t– test configuration. nginx -s reload – reload without dropping connections. systemctl status php-fpm – check PHP‑FPM service. tail -f /var/log/nginx/error.log – live error log. grep -E "502|504|upstream|connect" /var/log/nginx/error.log – filter gateway errors.
awk '$9==502 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10– top 502 URLs. ss -s – connection summary. ulimit -n – file‑descriptor limit.
Summary
502 = backend unavailable or exhausted. Typical causes: service crash, PHP‑FPM process pool full, socket permission errors, connection‑limit exhaustion.
504 = backend response exceeds Nginx timeout. Typical causes: slow SQL queries, external API latency, code dead‑loops, lock contention, timeout values too short.
Root‑cause analysis using error.log, access.log, backend status, and timeout settings resolves the majority of incidents without blind restarts or arbitrary timeout increases.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
