Operations 27 min read

Master Nginx Troubleshooting: From 502 Errors to Performance Optimization

This article walks you through ten real-world Nginx failure cases—covering 502 errors, SSL expiration, high concurrency bottlenecks, cache misconfigurations, log rotation issues, load‑balancing mistakes, security gaps, reverse‑proxy quirks, URL rewrite conflicts, and monitoring—while teaching a systematic diagnostic methodology for ops engineers.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Nginx Troubleshooting: From 502 Errors to Performance Optimization

From 502 to Troubleshooting: Common Nginx Failure Cases

As an operations engineer, have you ever been woken up by a 502 alarm in the middle of the night? This article uses real cases to guide you from a novice to an expert in Nginx fault diagnosis.

Introduction: Nginx Pitfalls We’ve All Encountered

In the career of an internet‑company ops engineer, Nginx failures are among the most frequent and frustrating problems, ranging from simple misconfigurations to complex performance bottlenecks.

With eight years of experience, the author shares a proven troubleshooting methodology through ten real cases.

Case 1: Classic 502 – Upstream Service Unreachable

Symptom

An e‑commerce site experiences massive 502 errors during a promotion, preventing users from placing orders.

Investigation Steps

Step 1: Check Nginx error log

# View the latest error log
tail -f /var/log/nginx/error.log

# Typical 502 log entry
2024/09/15 14:30:25 [error] 12345#0: *67890 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.1.100, server: shop.example.com, request: "POST /api/order HTTP/1.1", upstream: "http://192.168.1.200:8080/api/order", host: "shop.example.com"

Step 2: Verify upstream service status

# Check if backend service is running
netstat -tulpn | grep 8080
ps aux | grep java

# Test connectivity
curl -I http://192.168.1.200:8080/health
telnet 192.168.1.200 8080

Step 3: Analyze Nginx configuration

upstream backend_servers {
    server 192.168.1.200:8080 weight=1 max_fails=3 fail_timeout=30s;
    server 192.168.1.201:8080 weight=1 max_fails=3 fail_timeout=30s backup;
}

server {
    listen 80;
    server_name shop.example.com;

    location /api/ {
        proxy_pass http://backend_servers;
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;
        proxy_send_timeout 60s;
    }
}

Root Cause

The primary server (192.168.1.200) crashed under high load, and the backup server’s configuration prevented it from taking over.

Solution

# 1. Restart the failed application
systemctl restart tomcat

# 2. Fix backup server config (remove backup flag)
upstream backend_servers {
    server 192.168.1.200:8080 weight=1 max_fails=2 fail_timeout=10s;
    server 192.168.1.201:8080 weight=1 max_fails=2 fail_timeout=10s;
}

# 3. Reload Nginx
nginx -t && nginx -s reload

Prevention

Configure health‑check mechanisms.

Set appropriate load‑balancing strategies.

Establish a robust monitoring and alert system.

Case 2: SSL Certificate Expiration Causing Service Outage

Symptom

A financial website shows "Your connection is not private" errors.

Investigation

Check SSL certificate status

# View certificate expiration
openssl x509 -in /etc/nginx/ssl/domain.crt -noout -dates

# Online check
openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates

# View Nginx SSL config
nginx -T | grep -A 10 -B 5 ssl_certificate

Sample SSL configuration

server {
    listen 443 ssl http2;
    server_name finance.example.com;

    ssl_certificate /etc/nginx/ssl/domain.crt;
    ssl_certificate_key /etc/nginx/ssl/domain.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384;
    ssl_prefer_server_ciphers off;
    add_header Strict-Transport-Security "max-age=31536000" always;
}

Solution

# 1. Generate a new certificate (e.g., Let's Encrypt)
certbot --nginx -d finance.example.com

# 2. Update SSL paths
ssl_certificate /etc/letsencrypt/live/finance.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/finance.example.com/privkey.pem;

# 3. Test and reload
nginx -t && nginx -s reload

# 4. Verify
curl -I https://finance.example.com

Automation

# Certificate renewal cron
cat > /etc/cron.d/certbot <<'EOF'
0 12 * * * /usr/bin/certbot renew --quiet --post-hook "nginx -s reload"
EOF

# Simple SSL monitor script
cat > /usr/local/bin/ssl_check.sh <<'EOF'
#!/bin/bash
DOMAIN="finance.example.com"
DAYS=30
EXPIRY_DATE=$(echo | openssl s_client -connect $DOMAIN:443 2>/dev/null | openssl x509 -noout -enddate | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY_DATE" +%s)
CURRENT_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - CURRENT_EPOCH) / 86400 ))
if [ $DAYS_LEFT -lt $DAYS ]; then
  echo "SSL certificate for $DOMAIN expires in $DAYS_LEFT days!"
fi
EOF

Case 3: High Concurrency Performance Bottleneck

Symptom

A video site experiences slow responses and playback failures during peak hours.

Performance Tools

# Check Nginx connections
curl http://localhost/nginx_status

# System load
htop

# Network connections count
ss -tuln | wc -l
netstat -an | grep :80 | wc -l

Nginx Status Page Configuration

server {
    listen 80;
    server_name localhost;
    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1;
        deny all;
    }
}

Performance Optimizations

# Main worker settings
worker_processes auto;
worker_connections 65535;
worker_rlimit_nofile 65535;

events {
    use epoll;
    multi_accept on;
    worker_connections 65535;
}

http {
    # Enable gzip
    gzip on;
    gzip_vary on;
    gzip_min_length 1000;
    gzip_types text/plain text/css application/json application/javascript;

    # File cache
    open_file_cache max=100000 inactive=20s;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;

    # Connection tuning
    keepalive_timeout 65;
    keepalive_requests 100;

    # Buffer tuning
    client_body_buffer_size 128k;
    client_max_body_size 50m;
    client_header_buffer_size 1k;
    large_client_header_buffers 4 4k;
}

System‑Level Tweaks

# sysctl parameters
cat >> /etc/sysctl.conf <<'EOF'
net.core.somaxconn = 65535
net.core.netdev_max_backlog = 5000
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_tw_buckets = 5000
fs.file-max = 1000000
EOF
sysctl -p

Case 4: Cache Misconfiguration Issues

Symptom

A news site shows stale content even after clearing browser cache.

Cache Configuration Analysis

server {
    listen 80;
    server_name news.example.com;

    # Static assets long‑term cache
    location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
        expires 1y;
        add_header Cache-Control "public, immutable";
        add_header Pragma public;
    }

    # Dynamic content
    location / {
        proxy_pass http://backend;
        # Wrong cache directives
        proxy_cache_valid 200 302 10m;
        proxy_cache_valid 404 1m;
        add_header X-Cache-Status $upstream_cache_status;
    }
}

Problem Diagnosis

# Inspect cache directory
ls -la /var/cache/nginx/

# View cache config
nginx -T | grep -A 20 proxy_cache

# Test cache status header
curl -I http://news.example.com/article/123 | grep X-Cache-Status

Correct Cache Configuration

http {
    proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=my_cache:10m max_size=10g inactive=60m use_temp_path=off;

    server {
        listen 80;
        server_name news.example.com;

        # API – no cache
        location /api/ {
            proxy_pass http://backend;
            proxy_cache off;
            add_header Cache-Control "no-cache, no-store, must-revalidate";
        }

        # News articles – cache for 5 minutes
        location /article/ {
            proxy_pass http://backend;
            proxy_cache my_cache;
            proxy_cache_valid 200 5m;
            proxy_cache_use_stale error timeout updating;
            add_header X-Cache-Status $upstream_cache_status;
        }

        # Static assets – long‑term cache
        location ~* \.(jpg|jpeg|png|gif|ico)$ {
            expires 1y;
            add_header Cache-Control "public, immutable";
        }
        location ~* \.(css|js)$ {
            expires 1d;
            add_header Cache-Control "public";
        }
    }
}

Cache Management Tools

# Purge specific URL
curl -X PURGE http://news.example.com/article/123

# Bulk purge older cache files
find /var/cache/nginx -type f -name "*.cache" -mtime +7 -delete

# Cache statistics script
cat > /usr/local/bin/cache_stats.sh <<'EOF'
#!/bin/bash
CACHE_DIR="/var/cache/nginx"
echo "Cache directory size: $(du -sh $CACHE_DIR)"
echo "Cache files count: $(find $CACHE_DIR -type f | wc -l)"
echo "Cache hit rate: $(grep -c HIT /var/log/nginx/access.log)"
EOF

Case 5: Log Rotation Failure Leading to Disk Exhaustion

Symptom

The server becomes unresponsive because the disk is 100% full due to oversized Nginx logs.

Diagnosis

# Check disk usage
df -h

# Find large files in log directory
du -h /var/log/nginx/ | sort -hr

# Inspect logrotate config
cat /etc/logrotate.d/nginx

Fix and Optimization

# Truncate current logs (emergency)
> /var/log/nginx/access.log
> /var/log/nginx/error.log

# Reopen logs
nginx -s reopen

Improved Logrotate Configuration

/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 640 nginx nginx
    sharedscripts
    postrotate
        if [ -f /var/run/nginx.pid ]; then
            kill -USR1 `cat /var/run/nginx.pid`
        fi
    endscript
}

Log Format Optimization

http {
    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for" '
                      'rt=$request_time uct="$upstream_connect_time" '
                      'uht="$upstream_header_time" urt="$upstream_response_time"';

    map $status $loggable {
        ~^[23] 0;
        default 1;
    }

    server {
        access_log /var/log/nginx/access.log main if=$loggable;
        location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ { access_log off; expires 1y; }
    }
}

Disk Monitoring Script

# Disk usage monitor
cat > /usr/local/bin/disk_monitor.sh <<'EOF'
#!/bin/bash
THRESHOLD=80
USAGE=$(df / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ $USAGE -gt $THRESHOLD ]; then
  echo "Disk usage is ${USAGE}%, exceeding threshold of ${THRESHOLD}%"
  find /var/log/nginx -name "*.log.*" -mtime +7 -delete
fi
EOF

Case 6: Load‑Balancing Configuration Errors

Symptom

Traffic is unevenly distributed across backend servers, causing overload on some nodes.

Load‑Balancing Strategies Comparison

# Round‑robin (default)
upstream backend_round_robin {
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

# Weighted round‑robin
upstream backend_weighted {
    server 192.168.1.10:8080 weight=3;
    server 192.168.1.11:8080 weight=2;
    server 192.168.1.12:8080 weight=1;
}

# IP hash
upstream backend_ip_hash {
    ip_hash;
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

# Least connections
upstream backend_least_conn {
    least_conn;
    server 192.168.1.10:8080;
    server 192.168.1.11:8080;
    server 192.168.1.12:8080;
}

Health‑Check Configuration

upstream backend_with_health {
    server 192.168.1.10:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8080 max_fails=3 fail_timeout=30s backup;
    keepalive 32;
}

server {
    location / {
        proxy_pass http://backend_with_health;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 5s;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

Backend Health‑Check Script

# /usr/local/bin/backend_health_check.sh
#!/bin/bash
SERVERS=("192.168.1.10:8080" "192.168.1.11:8080" "192.168.1.12:8080")
for server in "${SERVERS[@]}"; do
  if curl -sf "http://$server/health" > /dev/null; then
    echo "$server: OK"
  else
    echo "$server: FAILED"
  fi
done
EOF

Case 7: Security Configuration Vulnerabilities

Symptom

Security scans reveal multiple Nginx vulnerabilities.

Hardening Settings

server {
    listen 80;
    server_name secure.example.com;

    # Hide version
    server_tokens off;
    more_set_headers "Server: WebServer";

    # Security headers
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-XSS-Protection "1; mode=block" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header Referrer-Policy "no-referrer-when-downgrade" always;
    add_header Content-Security-Policy "default-src 'self' http: https: data: blob: 'unsafe-inline'" always;

    # Restrict methods
    if ($request_method !~ ^(GET|HEAD|POST)$) { return 405; }

    # Prevent directory traversal
    location ~ /\. { deny all; access_log off; log_not_found off; }

    # Limit upload size
    client_max_body_size 10M;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=login:10m rate=1r/s;

    location /api/ { limit_req zone=api burst=20 nodelay; proxy_pass http://backend; }
    location /login { limit_req zone=login burst=5 nodelay; proxy_pass http://backend; }
}

Fail2Ban Protection

# /etc/fail2ban/filter.d/nginx-4xx.conf
[Definition]
failregex = ^<HOST> -.*"(GET|POST).*" (404|403|400) .*$
ignoreregex =

# /etc/fail2ban/jail.local
[nginx-4xx]
enabled = true
port = http,https
filter = nginx-4xx
logpath = /var/log/nginx/access.log
maxretry = 10
bantime = 3600
findtime = 60
EOF

Case 8: Reverse‑Proxy Real‑IP Loss

Symptom

Backend services cannot obtain the client’s real IP when Nginx acts as a reverse proxy.

Solution

server {
    listen 80;
    server_name api.example.com;
    location / {
        proxy_pass http://backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_redirect off;
        proxy_connect_timeout 30s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
        proxy_buffering on;
        proxy_buffer_size 4k;
        proxy_buffers 8 4k;
        proxy_busy_buffers_size 8k;
    }
}

WebSocket Support

map $http_upgrade $connection_upgrade { default upgrade; '' close; }

server {
    listen 80;
    server_name ws.example.com;
    location /websocket {
        proxy_pass http://backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
        proxy_read_timeout 86400;
    }
}

Case 9: URL Rewrite Conflicts

Symptom

Complex rewrite rules cause redirect loops and 404 errors.

Optimized Rewrite Rules

server {
    listen 80;
    server_name example.com www.example.com;

    # Force primary domain
    if ($host != 'example.com') { return 301 https://example.com$request_uri; }

    # SEO‑friendly rewrites
    location / {
        try_files $uri $uri/ @rewrites;
    }
    location @rewrites {
        rewrite ^/product/([0-9]+)$ /product.php?id=$1 last;
        rewrite ^/category/([a-zA-Z0-9-]+)$ /category.php?name=$1 last;
        rewrite ^/user/([a-zA-Z0-9]+)$ /profile.php?username=$1 last;
        return 404;
    }

    # Prevent PHP redirect loops
    location ~ \.php$ {
        try_files $uri =404;
        fastcgi_pass 127.0.0.1:9000;
        fastcgi_index index.php;
        include fastcgi_params;
    }
}

Debugging Rewrite Rules

# Enable rewrite logging
error_log /var/log/nginx/rewrite.log notice;
rewrite_log on;

# Test rule
location /test {
    rewrite ^/test/(.*)$ /debug?param=$1 break;
    return 200 "Rewrite test: $args
";
}

Case 10: Performance Monitoring and Tuning

Symptom

A comprehensive Nginx performance monitoring system is needed to detect issues early.

Monitoring Script

# /usr/local/bin/nginx_monitor.sh
#!/bin/bash
NGINX_STATUS_URL="http://localhost/nginx_status"
LOG_FILE="/var/log/nginx_monitor.log"

STATUS=$(curl -s $NGINX_STATUS_URL)
ACTIVE_CONN=$(echo "$STATUS" | grep "Active connections" | awk '{print $3}')
READING=$(echo "$STATUS" | awk 'NR==3 {print $2}')
WRITING=$(echo "$STATUS" | awk 'NR==3 {print $4}')
WAITING=$(echo "$STATUS" | awk 'NR==3 {print $6}')

echo "$(date): Active:$ACTIVE_CONN, Reading:$READING, Writing:$WRITING, Waiting:$WAITING" >> $LOG_FILE

if [ $ACTIVE_CONN -gt 1000 ]; then
  echo "High connection count: $ACTIVE_CONN" | logger -t nginx_monitor
fi
EOF

Comprehensive Tuning Configuration

worker_processes auto;
worker_cpu_affinity auto;
worker_rlimit_nofile 100000;

error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;

events {
    use epoll;
    worker_connections 10240;
    multi_accept on;
    accept_mutex off;
}

http {
    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" $request_time $upstream_response_time';

    # Performance tweaks
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    keepalive_requests 1000;
    gzip on;
    gzip_vary on;
    gzip_min_length 1000;
    gzip_comp_level 6;
    gzip_types text/plain text/css application/json application/javascript text/xml application/xml;
    open_file_cache max=100000 inactive=20s;
    open_file_cache_valid 30s;
    open_file_cache_min_uses 2;
    open_file_cache_errors on;

    # Security
    server_tokens off;
    client_header_timeout 10;
    client_body_timeout 10;
    reset_timedout_connection on;
    send_timeout 10;

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=global:10m rate=100r/s;
    limit_conn_zone $binary_remote_addr zone=addr:10m;

    include /etc/nginx/conf.d/*.conf;
}

Fault‑Diagnosis Methodology Summary

1. Standardized Investigation Process

Collect fault information : confirm symptoms, impact scope, timing.

Check logs : error.log, access.log, system logs.

Inspect configuration files : syntax and logic checks.

Validate network connectivity : port status, connectivity tests.

Analyze performance metrics : CPU, memory, network, disk.

Identify root cause : deep analysis to find the true reason.

Implement solution : temporary fix followed by permanent resolution.

Verify remediation : functional and performance testing.

Document lessons learned : update docs and improve processes.

2. Common Diagnostic Tools

Log analysis : tail, grep, awk, sed.

Network utilities : curl, wget, telnet, netstat, ss.

Performance monitoring : htop, iotop, iftop, nginx‑status.

System diagnosis : strace, lsof, tcpdump.

3. Preventive Measures

Establish a complete monitoring and alert system.

Regularly back up configuration files.

Adopt automated operations tools.

Define standardized operating procedures.

Conduct periodic fault‑drill exercises.

Conclusion

Nginx fault diagnosis is an essential skill for operations engineers, requiring solid theoretical knowledge and extensive hands‑on experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsDevOpsSecuritytroubleshooting502 error
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.