Nginx Log Analysis: Debugging Request Timeouts and 4xx/5xx Errors
This guide explains how to interpret Nginx access and error logs, understand the meaning of each log field, configure timeout directives across client, Nginx, upstream, and FastCGI layers, troubleshoot common 4xx and 5xx status codes, and use practical command‑line tools and analysis pipelines to quickly locate and resolve performance and connectivity issues.
Log Field Overview
The default combined format logs client IP, user, timestamp, request line, status, bytes sent, referer and user‑agent. A production‑ready format adds $request_time, $upstream_connect_time, $upstream_header_time, $upstream_response_time, $request_id, $http_x_forwarded_for, $scheme, $host, $request_uri and other useful fields.
Key ranges: $request_time 0.001‑0.5 s (slow > 1 s), $upstream_response_time ≤ $request_time, $upstream_connect_time normally 0‑0.005 s (warning > 0.05 s).
Error‑Log Keywords
Typical strings to search in error.log are connect() failed, upstream timed out, no live upstreams, client sent invalid header line, SSL_do_handshake(), worker connections are not enough, and accept() failed (24: Too many open files).
Timeout Configuration
Timeouts are split into four layers:
Client layer – client_body_timeout, client_header_timeout, send_timeout, keepalive_timeout.
Nginx layer – proxy_connect_timeout, proxy_send_timeout, proxy_read_timeout, fastcgi_connect_timeout, fastcgi_send_timeout, fastcgi_read_timeout, uwsgi_read_timeout, keepalive_timeout (socket).
FastCGI/uwsgi layer – request_terminate_timeout (PHP‑FPM).
Socket layer – backlog, deferred, reuseport.
Risk: an overly permissive proxy_next_upstream (e.g., error timeout) can amplify failures for non‑idempotent methods.
4xx Status Code Diagnosis
Common causes and checks:
400 Bad Request : malformed headers, oversized body, URL‑encoding errors, HTTP/0.9 usage, SSL handshake failures. Use
awk '$9 == 400 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | headto find hot URLs.
401 Unauthorized : missing or malformed Authorization header, expired token, bad Basic auth. Filter with awk '$9 == 401 {print $7}' ….
403 Forbidden : directory listing disabled, IP blacklist, file permission issues, location‑level deny all, limit_req mis‑triggered, SELinux denials, client certificate failures. Use grep '403' /var/log/nginx/access.log | tail -1 and search error.log for “permission denied”.
404 Not Found : wrong path, missing static file, upstream missing route, bad try_files, missing $request_uri in proxy. Find top missing URLs with awk '$9 == 404 {print $7}' ….
405 Method Not Allowed : POST to static file, upstream not supporting method, limit_except restrictions.
408 Request Timeout : client slow request or keep‑alive idle timeout. Search error.log for “client timed out”.
413 Payload Too Large : request body exceeds client_max_body_size. Increase the directive.
414 URI Too Long : URL exceeds large_client_header_buffers. Adjust buffer size.
429 Too Many Requests : limit_req or limit_conn triggered. Look for “limiting requests” in error.log.
499 Client Closed Request : client aborted connection (timeout, user stop, app network loss). Correlate with high $request_time to see if slow response caused abort.
5xx Status Code Diagnosis
Typical causes and checks:
500 Internal Server Error : upstream returned 500, bad return syntax, malformed if, module load failure. Check upstream logs.
502 Bad Gateway : upstream unreachable, firewall block, upstream down, wrong IP/port, upstream listening only on 127.0.0.1, overloaded upstream accept queue. Use awk '$9 == 502 {print $7,$1}' … and search error.log for “connect() failed”.
503 Service Unavailable : all upstreams marked down ( max_fails), upstream overload, maintenance mode. Check upstream health and max_fails settings.
504 Gateway Timeout : Nginx timed out waiting for upstream response. Usually proxy_read_timeout too short. Verify $upstream_connect_time and $upstream_response_time values.
Distinguish 502 vs 504 by examining $upstream_connect_time (‑ for 502) and $upstream_response_time (≈ proxy_read_timeout for 504).
Real‑World Cases
Examples include Tomcat slow query causing 504 (increase proxy_read_timeout), accept‑queue overflow leading to 502 (tune somaxconn and tcp_max_syn_backlog), worker‑connections exhaustion (adjust worker_processes and worker_connections), keep‑alive misuse (add keepalive in upstream), proxy buffer mis‑size causing truncated responses (increase proxy_buffer_size, proxy_buffers), disk full causing 502 (clean logs, expand storage), DNS cache stale causing 502 (add resolver), send‑timeout vs keepalive conflict (increase send_timeout), limit‑req mis‑configuration causing 503 (adjust burst and nodelay).
Log Analysis Commands
Common one‑liners:
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn awk '$9 ~ /^4/ {c4++} END {print c4/NR*100 "%"}' /var/log/nginx/access.log awk '$9 == 404 {print $7}' /var/log/nginx/access.log | sort -u | wc -l awk '{print $14,$7}' /var/log/nginx/access.log | sort -rn | head -20Tools: goaccess for interactive terminal dashboards, ELK/Opensearch for centralized analysis, PLG (Promtail+Loki+Grafana) for lightweight log aggregation.
Log Rotation & Archiving
Standard /etc/logrotate.d/nginx rotates daily, keeps 30 files, compresses, and uses kill -USR1 $(cat /var/run/nginx.pid) to reopen logs without dropping connections. Use logrotate options postrotate, sharedscripts. Monitor for “worker connections are not enough” during rotation.
Monitoring & Alerting
Expose stub_status and scrape with nginx‑prometheus‑exporter. Key metrics: nginx_http_requests_total, nginx_http_5xx_responses_total, nginx_http_request_duration_seconds_bucket, nginx_connections_*, nginx_upstream_peers. Example Prometheus rules for high 5xx rate, high 4xx rate, connection saturation, and upstream down.
Configuration Templates
Provide a generic reverse‑proxy template with upstream block, tuned client_max_body_size, gzip, timeouts, buffer sizes, health‑check location, and SSL hardening (TLSv1.2/1.3, strong ciphers, HSTS, CSP). Also include static‑resource template with long expires and access_log off for assets.
Performance Tuning
System‑level tweaks ( net.core.somaxconn, tcp_max_syn_backlog, net.ipv4.tcp_tw_reuse), file‑descriptor limits ( nofile, LimitNOFILE), CPU affinity ( worker_cpu_affinity auto), buffer paths, Gzip level, open_file_cache, and connection‑pool sizing. Adjust worker_rlimit_nofile and worker_connections to avoid “worker connections are not enough”.
Common Pitfalls
Increasing worker_connections alone does not solve CPU or memory bottlenecks.
Large keepalive values consume memory and may exceed upstream limits.
Setting proxy_read_timeout too high can keep slow upstream connections open and exhaust resources.
Verbose log formats increase I/O; balance needed fields with performance.
Blindly assuming 503 means upstream down ignores rate‑limit or limit_req triggers.
Rollback & Deployment
Always backup config, run nginx -t, then nginx -s reload. If reload fails, revert to backup and reload again. Use split_clients or Lua for canary releases, and keep if blocks minimal.
Advanced Topics
Integration with OpenResty for Lua‑based custom logging, per‑vhost log splitting, conditional logging with map, async syslog output (NGINX Plus/Tengine), WebSocket proxying (upgrade headers, long timeouts), gRPC support (HTTP/2), rate‑limit with limit_req_zone, security hardening (method, UA, Referer, CC protection, Slowloris timeouts, hide version), CDN back‑origin validation, WAF via ModSecurity, GeoIP filtering, Nginx Ingress Controller for Kubernetes, sidecar deployment, dynamic upstream via Redis/Lua, and CI/CD linting with gixy and nginx -t.
Checklist for Troubleshooting
Locate access.log and error.log.
Verify Nginx process is running ( systemctl status nginx).
Confirm configuration file path and syntax ( nginx -t).
Check listening ports ( ss -ltn).
Inspect connection counts ( ss -s, curl localhost/nginx_status).
Identify recent 5xx entries with awk '$9 ~ /^5/ {print $9,$7,$NF}' access.log.
Find recent 504 latency ( awk '$9 == 504 {print $14,$18}' access.log).
Ensure upstream services are healthy (health‑check endpoint, process list).
Check disk space ( df -h) and inode usage.
Validate file‑descriptor limits ( ulimit -n, lsof).
Monitor CPU interrupt load ( top).
Review network stats ( sar -n DEV 1, iftop).
Search error log for key strings (connect failures, timeouts, no live upstreams).
Confirm recent reload timestamps ( systemctl status nginx).
Calculate recent QPS and error rate (line count, grep 5[0-9][0-9]).
Following this checklist resolves the majority of Nginx‑related incidents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
