Master Nginx Troubleshooting: From 502 Errors to Performance Optimization
This article walks you through ten real-world Nginx failure cases—covering 502 errors, SSL expiration, high concurrency bottlenecks, cache misconfigurations, log rotation issues, load‑balancing mistakes, security gaps, reverse‑proxy quirks, URL rewrite conflicts, and monitoring—while teaching a systematic diagnostic methodology for ops engineers.
From 502 to Troubleshooting: Common Nginx Failure Cases
As an operations engineer, have you ever been woken up by a 502 alarm in the middle of the night? This article uses real cases to guide you from a novice to an expert in Nginx fault diagnosis.
Introduction: Nginx Pitfalls We’ve All Encountered
In the career of an internet‑company ops engineer, Nginx failures are among the most frequent and frustrating problems, ranging from simple misconfigurations to complex performance bottlenecks.
With eight years of experience, the author shares a proven troubleshooting methodology through ten real cases.
Case 1: Classic 502 – Upstream Service Unreachable
Symptom
An e‑commerce site experiences massive 502 errors during a promotion, preventing users from placing orders.
Investigation Steps
Step 1: Check Nginx error log
# View the latest error log
tail -f /var/log/nginx/error.log
# Typical 502 log entry
2024/09/15 14:30:25 [error] 12345#0: *67890 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.1.100, server: shop.example.com, request: "POST /api/order HTTP/1.1", upstream: "http://192.168.1.200:8080/api/order", host: "shop.example.com"Step 2: Verify upstream service status
# Check if backend service is running
netstat -tulpn | grep 8080
ps aux | grep java
# Test connectivity
curl -I http://192.168.1.200:8080/health
telnet 192.168.1.200 8080Step 3: Analyze Nginx configuration
upstream backend_servers {
server 192.168.1.200:8080 weight=1 max_fails=3 fail_timeout=30s;
server 192.168.1.201:8080 weight=1 max_fails=3 fail_timeout=30s backup;
}
server {
listen 80;
server_name shop.example.com;
location /api/ {
proxy_pass http://backend_servers;
proxy_connect_timeout 5s;
proxy_read_timeout 60s;
proxy_send_timeout 60s;
}
}Root Cause
The primary server (192.168.1.200) crashed under high load, and the backup server’s configuration prevented it from taking over.
Solution
# 1. Restart the failed application
systemctl restart tomcat
# 2. Fix backup server config (remove backup flag)
upstream backend_servers {
server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}
# 3. Reload Nginx
nginx -t && nginx -s reloadPrevention
Configure health‑check mechanisms.
Set appropriate load‑balancing strategies.
Establish a robust monitoring and alert system.
Case 2: SSL Certificate Expiration Causing Service Outage
Symptom
A financial website shows "Your connection is not private" errors.
Investigation
Check SSL certificate status
# View certificate expiration
openssl x509 -in /etc/nginx/ssl/domain.crt -noout -dates
# Online check
openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates
# View Nginx SSL config
nginx -T | grep -A 10 -B 5 ssl_certificateSample SSL configuration
server {
listen 443 ssl http2;
server_name finance.example.com;
ssl_certificate /etc/nginx/ssl/domain.crt;
ssl_certificate_key /etc/nginx/ssl/domain.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers off;
add_header Strict-Transport-Security "max-age=31536000" always;
}Solution
# 1. Generate a new certificate (e.g., Let's Encrypt)
certbot --nginx -d finance.example.com
# 2. Update SSL paths
ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem;
# 3. Test and reload
nginx -t && nginx -s reload
# 4. Verify
curl -I https://finance.example.comAutomation
# Certificate renewal cron
cat > /etc/cron.d/certbot <<'EOF'
0 12 * * * /usr/bin/certbot renew --quiet --post-hook "nginx -s reload"
EOF
# Simple SSL monitor script
cat > /usr/local/bin/ssl_check.sh <<'EOF'
#!/bin/bash
DOMAIN="finance.example.com"
DAYS=30
EXPIRY_DATE=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
CURRENT_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - CURRENT_EPOCH) / 86400 ))
if [ $DAYS_LEFT -lt $DAYS ]; then
echo "SSL certificate for $DOMAIN expires in $DAYS_LEFT days!"
fi
EOFCase 3: High Concurrency Performance Bottleneck
Symptom
A video site experiences slow responses and playback failures during peak hours.
Performance Tools
# Check Nginx connections
curl http://localhost/nginx_status
# System load
htop
# Network connections count
ss -tuln | wc -l
netstat -an | grep :80 | wc -lNginx Status Page Configuration
server {
listen 80;
server_name localhost;
location /nginx_status {
stub_status on;
access_log off;
allow 127.0.0.1;
deny all;
}
}Performance Optimizations
# Main worker settings
worker_processes auto;
worker_connections 65535;
worker_rlimit_nofile 65535;
events {
use epoll;
multi_accept on;
worker_connections 65535;
}
http {
# Enable gzip
gzip on;
gzip_vary on;
gzip_min_length 1000;
gzip_types text/plain text/css application/json application/javascript;
# File cache
open_file_cache max=100000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
# Connection tuning
keepalive_timeout 65;
keepalive_requests 100;
# Buffer tuning
client_body_buffer_size 128k;
client_max_body_size 50m;
client_header_buffer_size 1k;
large_client_header_buffers 4 4k;
}System‑Level Tweaks
# sysctl parameters
cat >> /etc/sysctl.conf <<'EOF'
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_tw_buckets = 5000
fs.file-max = 1000000
EOF
sysctl -pCase 4: Cache Misconfiguration Issues
Symptom
A news site shows stale content even after clearing browser cache.
Cache Configuration Analysis
server {
listen 80;
server_name news.example.com;
# Static assets long‑term cache
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
expires 1y;
add_header Cache-Control "public, immutable";
add_header Pragma public;
}
# Dynamic content
location / {
proxy_pass http://backend;
# Wrong cache directives
proxy_cache_valid 200 302 10m;
proxy_cache_valid 404 1m;
add_header X-Cache-Status $upstream_cache_status;
}
}Problem Diagnosis
# Inspect cache directory
ls -la /var/cache/nginx/
# View cache config
nginx -T | grep -A 20 proxy_cache
# Test cache status header
curl -I http://news.example.com/article/123 | grep X-Cache-StatusCorrect Cache Configuration
http {
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m use_temp_path=off;
server {
listen 80;
server_name news.example.com;
# API – no cache
location /api/ {
proxy_pass http://backend;
proxy_cache off;
add_header Cache-Control "no-cache, no-store, must-revalidate";
}
# News articles – cache for 5 minutes
location /article/ {
proxy_pass http://backend;
proxy_cache my_cache;
proxy_cache_valid 200 5m;
proxy_cache_use_stale error timeout updating;
add_header X-Cache-Status $upstream_cache_status;
}
# Static assets – long‑term cache
location ~* \.(jpg|jpeg|png|gif|ico)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}
location ~* \.(css|js)$ {
expires 1d;
add_header Cache-Control "public";
}
}
}Cache Management Tools
# Purge specific URL
curl -X PURGE http://news.example.com/article/123
# Bulk purge older cache files
find /var/cache/nginx -type f -name "*.cache" -mtime +7 -delete
# Cache statistics script
cat > /usr/local/bin/cache_stats.sh <<'EOF'
#!/bin/bash
CACHE_DIR="/var/cache/nginx"
echo "Cache directory size: $(du -sh $CACHE_DIR)"
echo "Cache files count: $(find $CACHE_DIR -type f | wc -l)"
echo "Cache hit rate: $(grep -c HIT /var/log/nginx/access.log)"
EOFCase 5: Log Rotation Failure Leading to Disk Exhaustion
Symptom
The server becomes unresponsive because the disk is 100% full due to oversized Nginx logs.
Diagnosis
# Check disk usage
df -h
# Find large files in log directory
du -h /var/log/nginx/ | sort -hr
# Inspect logrotate config
cat /etc/logrotate.d/nginxFix and Optimization
# Truncate current logs (emergency)
> /var/log/nginx/access.log
> /var/log/nginx/error.log
# Reopen logs
nginx -s reopenImproved Logrotate Configuration
/var/log/nginx/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 640 nginx nginx
sharedscripts
postrotate
if [ -f /var/run/nginx.pid ]; then
kill -USR1 `cat /var/run/nginx.pid`
fi
endscript
}Log Format Optimization
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"';
map $status $loggable {
~^[23] 0;
default 1;
}
server {
access_log /var/log/nginx/access.log main if=$loggable;
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ { access_log off; expires 1y; }
}
}Disk Monitoring Script
# Disk usage monitor
cat > /usr/local/bin/disk_monitor.sh <<'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $USAGE -gt $THRESHOLD ]; then
echo "Disk usage is ${USAGE}%, exceeding threshold of ${THRESHOLD}%"
find /var/log/nginx -name "*.log.*" -mtime +7 -delete
fi
EOFCase 6: Load‑Balancing Configuration Errors
Symptom
Traffic is unevenly distributed across backend servers, causing overload on some nodes.
Load‑Balancing Strategies Comparison
# Round‑robin (default)
upstream backend_round_robin {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
}
# Weighted round‑robin
upstream backend_weighted {
server 192.168.1.10:8080 weight=3;
server 192.168.1.11:8080 weight=2;
server 192.168.1.12:8080 weight=1;
}
# IP hash
upstream backend_ip_hash {
ip_hash;
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
}
# Least connections
upstream backend_least_conn {
least_conn;
server 192.168.1.10:8080;
server 192.168.1.11:8080;
server 192.168.1.12:8080;
}Health‑Check Configuration
upstream backend_with_health {
server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.12:8080 max_fails=3 fail_timeout=30s backup;
keepalive 32;
}
server {
location / {
proxy_pass http://backend_with_health;
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 5s;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}Backend Health‑Check Script
# /usr/local/bin/backend_health_check.sh
#!/bin/bash
SERVERS=("192.168.1.10:8080" "192.168.1.11:8080" "192.168.1.12:8080")
for server in "${SERVERS[@]}"; do
if curl -sf "http://$server/health" > /dev/null; then
echo "$server: OK"
else
echo "$server: FAILED"
fi
done
EOFCase 7: Security Configuration Vulnerabilities
Symptom
Security scans reveal multiple Nginx vulnerabilities.
Hardening Settings
server {
listen 80;
server_name secure.example.com;
# Hide version
server_tokens off;
more_set_headers "Server: WebServer";
# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "no-referrer-when-downgrade" always;
add_header Content-Security-Policy "default-src 'self' http: https: data: blob: 'unsafe-inline'" always;
# Restrict methods
if ($request_method !~ ^(GET|HEAD|POST)$) { return 405; }
# Prevent directory traversal
location ~ /\. { deny all; access_log off; log_not_found off; }
# Limit upload size
client_max_body_size 10M;
# Rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=login:10m rate=1r/s;
location /api/ { limit_req zone=api burst=20 nodelay; proxy_pass http://backend; }
location /login { limit_req zone=login burst=5 nodelay; proxy_pass http://backend; }
}Fail2Ban Protection
# /etc/fail2ban/filter.d/nginx-4xx.conf
[Definition]
failregex = ^<HOST> -.*"(GET|POST).*" (404|403|400) .*$
ignoreregex =
# /etc/fail2ban/jail.local
[nginx-4xx]
enabled = true
port = http,https
filter = nginx-4xx
logpath = /var/log/nginx/access.log
maxretry = 10
bantime = 3600
findtime = 60
EOFCase 8: Reverse‑Proxy Real‑IP Loss
Symptom
Backend services cannot obtain the client’s real IP when Nginx acts as a reverse proxy.
Solution
server {
listen 80;
server_name api.example.com;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_redirect off;
proxy_connect_timeout 30s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 4k;
proxy_busy_buffers_size 8k;
}
}WebSocket Support
map $http_upgrade $connection_upgrade { default upgrade; '' close; }
server {
listen 80;
server_name ws.example.com;
location /websocket {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
proxy_read_timeout 86400;
}
}Case 9: URL Rewrite Conflicts
Symptom
Complex rewrite rules cause redirect loops and 404 errors.
Optimized Rewrite Rules
server {
listen 80;
server_name example.com www.example.com;
# Force primary domain
if ($host != 'example.com') { return 301 https://example.com$request_uri; }
# SEO‑friendly rewrites
location / {
try_files $uri $uri/ @rewrites;
}
location @rewrites {
rewrite ^/product/([0-9]+)$ /product.php?id=$1 last;
rewrite ^/category/([a-zA-Z0-9-]+)$ /category.php?name=$1 last;
rewrite ^/user/([a-zA-Z0-9]+)$ /profile.php?username=$1 last;
return 404;
}
# Prevent PHP redirect loops
location ~ \.php$ {
try_files $uri =404;
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
include fastcgi_params;
}
}Debugging Rewrite Rules
# Enable rewrite logging
error_log /var/log/nginx/rewrite.log notice;
rewrite_log on;
# Test rule
location /test {
rewrite ^/test/(.*)$ /debug?param=$1 break;
return 200 "Rewrite test: $args
";
}Case 10: Performance Monitoring and Tuning
Symptom
A comprehensive Nginx performance monitoring system is needed to detect issues early.
Monitoring Script
# /usr/local/bin/nginx_monitor.sh
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"
STATUS=$(curl -s $NGINX_STATUS_URL)
ACTIVE_CONN=$(echo "$STATUS" | grep "Active connections" | awk '{print $3}')
READING=$(echo "$STATUS" | awk 'NR==3 {print $2}')
WRITING=$(echo "$STATUS" | awk 'NR==3 {print $4}')
WAITING=$(echo "$STATUS" | awk 'NR==3 {print $6}')
echo "$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING" >> $LOG_FILE
if [ $ACTIVE_CONN -gt 1000 ]; then
echo "High connection count: $ACTIVE_CONN" | logger -t nginx_monitor
fi
EOFComprehensive Tuning Configuration
worker_processes auto;
worker_cpu_affinity auto;
worker_rlimit_nofile 100000;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
use epoll;
worker_connections 10240;
multi_accept on;
accept_mutex off;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" $request_time $upstream_response_time';
# Performance tweaks
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
keepalive_requests 1000;
gzip on;
gzip_vary on;
gzip_min_length 1000;
gzip_comp_level 6;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml;
open_file_cache max=100000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
# Security
server_tokens off;
client_header_timeout 10;
client_body_timeout 10;
reset_timedout_connection on;
send_timeout 10;
# Rate limiting
limit_req_zone $binary_remote_addr zone=global:10m rate=100r/s;
limit_conn_zone $binary_remote_addr zone=addr:10m;
include /etc/nginx/conf.d/*.conf;
}Fault‑Diagnosis Methodology Summary
1. Standardized Investigation Process
Collect fault information : confirm symptoms, impact scope, timing.
Check logs : error.log, access.log, system logs.
Inspect configuration files : syntax and logic checks.
Validate network connectivity : port status, connectivity tests.
Analyze performance metrics : CPU, memory, network, disk.
Identify root cause : deep analysis to find the true reason.
Implement solution : temporary fix followed by permanent resolution.
Verify remediation : functional and performance testing.
Document lessons learned : update docs and improve processes.
2. Common Diagnostic Tools
Log analysis : tail, grep, awk, sed.
Network utilities : curl, wget, telnet, netstat, ss.
Performance monitoring : htop, iotop, iftop, nginx‑status.
System diagnosis : strace, lsof, tcpdump.
3. Preventive Measures
Establish a complete monitoring and alert system.
Regularly back up configuration files.
Adopt automated operations tools.
Define standardized operating procedures.
Conduct periodic fault‑drill exercises.
Conclusion
Nginx fault diagnosis is an essential skill for operations engineers, requiring solid theoretical knowledge and extensive hands‑on experience.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
