Step‑by‑Step Debugging of a Slow Website: From Nginx to the Database
When a website’s response time jumped from 200 ms to over 10 seconds, this guide walks through a layered investigation—from confirming the scope, checking Nginx and upstream health, analyzing application logs, inspecting MySQL processes, slow queries, and locks, to examining server CPU, memory, disk I/O, and network—providing concrete commands, expected outputs, and root‑cause patterns for effective troubleshooting and preventive monitoring.
Problem Background
On a certain afternoon users reported page load times increasing from the normal 200 ms to more than 10 seconds, prompting an on‑call engineer to start a systematic investigation.
Applicable Scenarios
Overall website slowdown with no clear bottleneck.
New engineers needing a standardized troubleshooting workflow.
DevOps teams that must trace full‑stack performance issues.
Investigation Principles
Confirm whether the issue is global or local : all users affected or only a subset?
Identify if the problem lies in the network or the server side : Nginx layer vs. upstream layer?
Determine which service is the bottleneck : high CPU in the application process or slow database responses?
Phase 1: Determine Scope
Initial Check – Global vs. Local
# From multiple locations, compare response times
# If you have a CDN or monitoring platform, first check each node
for i in {1..5}; do
curl -o /dev/null -s -w "Time: %{time_total}s, HTTP: %{http_code}
" \
-H "Host: www.example.com" \
http://123.45.67.89/api/homepage
doneIf response times vary widely across locations, the issue may be network‑related; if they are similar, the problem is likely on the server side.
Confirm Time Range and Impact
# Count HTTP status codes and average response time from Nginx access_log
awk '{print $9}' /var/log/nginx/access.log | sort | uniq -c | sort -rn
# Count 5xx errors
grep "2024-03-15T14:" /var/log/nginx/access.log | \
awk '{if($9>=500) print $0}' | wc -l
# List requests taking longer than 5 seconds (requires $request_time in log_format)
awk '{if($NF>5) print $0}' /var/log/nginx/access.log | head -20Initial Diagnostic Commands
echo "=== 1. Nginx process status ==="
ps aux | grep nginx | grep -v grep
nginx -v 2>&1
echo "=== 2. Nginx connection status ==="
ss -s
netstat -an | awk '/:80\s/ {s[$NF]++} END {for(k in s) print k, s[k]}'
echo "=== 3. System load ==="
uptime
free -h
df -h /
echo "=== 4. CPU usage ==="
top -bn1 | head -20
echo "=== 5. Disk I/O ==="
iostat -x 1 1 2>/dev/null || echo "iostat not available"Phase 2: Nginx Layer
Check Nginx as Bottleneck
Typical Nginx‑level problems: high CPU/memory in worker processes, connection limits reached, or upstream timeout settings too short.
# CPU & memory per Nginx worker
ps -eo pid,ppid,comm,%cpu,%mem,rss | grep nginx
# Current concurrent connections
ss -tn | grep :8080 | wc -l
# Or using netstat
netstat -an | grep :8080 | grep ESTABLISHED | wc -l
# Verify worker_connections limit
grep "worker_connections" /etc/nginx/nginx.conf
# File descriptor usage per worker
ls /proc/$(pgrep -f "nginx: worker" | head -1)/fd | wc -lCheck Nginx error_log
# Look for 502/504/timeout related entries
tail -n 200 /var/log/nginx/error.log | grep -E "502|504|upstream|timeout|connect"
# Verify error_log level (debug produces massive logs)
grep "error_log" /etc/nginx/nginx.confCheck Upstream Response Time
If $upstream_response_time is logged, you can quickly see whether the delay originates from the upstream service.
# Count upstream responses >5 s
awk -F 'upstream_time: ' '{if($2>5) print $0}' /var/log/nginx/access.log | head -20
# Average upstream response time per upstream
awk -F 'upstream_response_time=' '{if($2!="-") print $2}' /var/log/nginx/access.log | \
awk -F ' ' '{sum+=$1; count++} END {print "Avg upstream time:", sum/count "s"}'Check Upstream Timeout Configuration
# Proxy timeout settings (short values cause frequent 504)
location /api/ {
proxy_pass http://backend;
proxy_connect_timeout 5s; # default 60s, usually not too short
proxy_read_timeout 60s; # requests longer than this return 504
proxy_send_timeout 60s;
}
# Verify effective timeout values
grep -E "proxy_.*timeout" /etc/nginx/conf.d/*.confPhase 3: Upstream Application Layer
Confirm Upstream Service Health
# Check if upstream ports are listening
ss -tlnp | grep -E "8080|3000|5000|9000"
# Test upstream health endpoint locally
curl -s -o /dev/null -w "HTTP: %{http_code}, Time: %{time_total}s" http://127.0.0.1:8080/health
# Verify upstream process status (Java, Node, Python, PHP)
ps aux | grep -E "java|node|python|php" | grep -v grep
# Docker containers
docker ps -a | grep -E "backend|api"Test Upstream Response Time
# Direct access vs. through Nginx
time curl -s http://127.0.0.1:8080/api/data > /dev/null
# If direct access is fast but Nginx is slow, the bottleneck is in Nginx
# If both are slow, the problem lies in the upstream or databasePacket Capture for Latency Distribution
# Capture traffic (use cautiously in production)
tcpdump -i lo -w /tmp/nginx_trace.pcap port 8080 &
sleep 5
curl http://127.0.0.1:8080/api/data
sleep 2
kill %1
# Analyze with tshark (requires wireshark-cli)
tshark -r /tmp/nginx_trace.pcap -Y "http.request" -T fields -e frame.time_relative -e http.request.uri | head -20Application Logs
# Spring Boot logs
tail -n 100 /var/log/app/application.log | grep -E "ERROR|WARN|Exception"
# Node.js logs
tail -n 100 /var/log/app/node.log | grep -E "Error|timeout|warning"
# Python/Flask logs
tail -n 100 /var/log/app/flask.log | grep -E "ERROR|Traceback|slow"
# Systemd‑managed service logs
journalctl -u app-backend --since "10 minutes ago" | grep -E "ERROR|Exception|timeout"Phase 4: Database Layer
Check MySQL Process Status
# MySQL process
ps aux | grep mysqld | grep -v grep
# Verify MySQL port
ss -tlnp | grep 3306
# Simple connection test
mysql -u app_user -p'password' -h 127.0.0.1 -e "SELECT 1"
# MySQL error log
tail -n 100 /var/log/mysql/error.log | grep -E "ERROR|WARNING|Abort"Inspect Running SQL
# Show all active queries (MySQL 5.7+)
SHOW FULL PROCESSLIST;
# Equivalent for MySQL 8.0+
SELECT id, user, host, db, command, time, LEFT(info,100) AS current_sql,
trx_started, trx_rows_locked, trx_is_read_only, trx_state
FROM information_schema.PROCESSLIST p
LEFT JOIN information_schema.INNODB_TRX t ON p.id = t.trx_mysql_thread_id
WHERE p.command != 'Sleep'
ORDER BY p.time DESC;Lock Wait Analysis
# MySQL 5.7 lock waits
SELECT r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query,
b.trx_started AS blocking_trx_started
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX b ON w.blocking_trx_id = b.trx_id
JOIN information_schema.INNODB_TRX r ON w.requesting_trx_id = r.trx_id;
# MySQL 8.0 lock waits
SELECT * FROM performance_schema.data_lock_waits;InnoDB Status
# Overall InnoDB status (includes deadlocks, lock info, buffer hit rate)
SHOW ENGINE INNODB STATUS\GSlow Query Analysis
# Verify slow query settings
SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';
SHOW VARIABLES LIKE 'log_output';
# Enable slow query log if disabled
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1; # log queries >1 s
SET GLOBAL slow_query_log_file = '/var/log/mysql/slow.log';
# Show recent slow queries (requires slow_query_log enabled)
SELECT start_time, query_time, lock_time, rows_sent, rows_examined,
LEFT(sql_text,200)
FROM mysql.slow_log
ORDER BY start_time DESC
LIMIT 10;
# Or from file system
cat /var/log/mysql/slow.log | grep -E "^# Time:|^# Query_time:" | head -50
# Find longest queries
awk '/^# Query_time:/ {gsub("Query_time: ",""); if($1>1) print $1, $0}' /var/log/mysql/slow.log | \
sort -rn | head -20EXPLAIN Analysis
# Basic EXPLAIN
EXPLAIN SELECT u.id, u.username, o.order_id, o.total_amount
FROM users u JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01' AND o.status = 'pending';
# JSON format (MySQL 5.6+)
EXPLAIN FORMAT=JSON SELECT ...;
# Show table structure and indexes
SHOW CREATE TABLE users\G
SHOW CREATE TABLE orders\G
# List indexes used by a table
SHOW INDEX FROM orders;Index Optimization
# Add a composite index for the typical query
CREATE INDEX idx_orders_user_status ON orders(user_id, status);Buffer Pool Hit Rate
# Buffer pool size
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
# Buffer pool usage statistics
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';
# Calculate hit rate: (read_requests - reads) / read_requestsConnection Count
# Max connections
SHOW VARIABLES LIKE 'max_connections';
# Current connections
SHOW GLOBAL STATUS LIKE 'Threads_connected';
SHOW GLOBAL STATUS LIKE 'Max_used_connections';
# If Max_used_connections approaches max_connections, increase the limitPhase 5: Server Resources
CPU Investigation
# Top CPU‑consuming processes
top -bn1 | head -20
# Or sorted list
ps aux --sort=-%cpu | head -10
# CPU core count and load average
nproc
uptime
# Per‑process CPU details for MySQL
pidstat -p $(pgrep -f "mysqld" | head -1) 1 5
# Fallback if pidstat missing
top -p $(pgrep -f "mysqld" | head -1)Memory Investigation
# Overall memory usage
free -h
# Processes consuming most memory
ps -eo pid,comm,%mem,rss --sort=-%mem | head -10
# Check for OOM killer events (common cause of MySQL disappearance)
dmesg | grep -i "out of memory"
dmesg | grep -iE "oom|mysql|killed"
# MySQL internal memory settings
mysql -u root -p -e "SELECT @@innodb_buffer_pool_size;"
mysql -u root -p -e "SELECT @@key_buffer_size;"
mysql -u root -p -e "SELECT @@query_cache_size;"Disk I/O Investigation
# Disk usage
df -h
# Identify processes with heavy I/O
iotop -o 2>/dev/null || (ps aux --sort=-%io | head -10)
# I/O statistics
iostat -x 1 3
# MySQL data and log directories
SHOW VARIABLES LIKE '%dir';
# Check large tables
SELECT table_schema, table_name, (data_length+index_length)/1024/1024 AS MB, table_rows
FROM information_schema.tables
WHERE table_schema NOT IN ('mysql','information_schema','performance_schema')
ORDER BY (data_length+index_length) DESC LIMIT 10;Network Investigation
# Connection summary
ss -s
# Count TIME_WAIT connections on port 80
netstat -an | awk '/:80\s/ {print $NF}' | sort | uniq -c | sort -rn
# Bandwidth usage per interface
cat /proc/net/dev | grep eth0
# Identify potential abusive IPs
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20
# Nginx QPS
awk '{print $4}' /var/log/nginx/access.log | \
awk -F'[' '{print $2}' | cut -d: -f1-2 | uniq -c | sort -rn | head -10Phase 6: Root‑Cause Identification
Typical patterns observed during the layered investigation:
Scenario 1 – Nginx upstream timeout : Direct upstream request is fast, Nginx is slow, and error_log shows 504 errors. Root cause : proxy_read_timeout set too short or upstream processing exceeds the limit. Fix : Increase proxy_read_timeout (e.g., to 300s).
Scenario 2 – Database connection‑pool exhaustion : Application processes exist, but SHOW PROCESSLIST shows many "Waiting for connection" or long‑running queries. Root cause : Connection pool size too small or slow queries occupying connections. Fix : Kill long‑running queries and increase pool size (e.g., maximum-pool-size: 50 in Spring Boot).
Scenario 3 – Slow query causing high CPU : SHOW PROCESSLIST reveals queries with full‑table scans; EXPLAIN shows type=ALL and high Handler_read_rnd_next. Root cause : Missing index or inefficient SQL. Fix : Add appropriate indexes (e.g.,
CREATE INDEX idx_orders_user_status ON orders(user_id, status);).
Scenario 4 – Low InnoDB buffer‑pool hit rate : SHOW ENGINE INNODB STATUS reports hit rate below 95 % and high memory usage. Root cause : Buffer pool too small or random I/O heavy. Fix : Increase innodb_buffer_pool_size to 60‑80 % of RAM and adjust instances.
Scenario 5 – Memory pressure and swap thrashing : free -h shows swap used; dmesg contains OOM messages; MySQL response becomes extremely slow. Root cause : Physical memory exhausted. Fix : Add RAM or reduce buffer pool size; temporarily disable swap only in non‑production environments.
Post‑mortem
After resolving the incident, record a timeline (e.g., when the first complaint arrived, when CPU spiked, when the slow query was identified, when the index was added, and when normal response times resumed). Document the exact root cause, why alerts did not fire earlier (e.g., slow query log was disabled), and list both immediate fixes and longer‑term preventive measures such as index creation, connection‑pool tuning, and enhanced monitoring.
Preventive Monitoring Recommendations
Nginx 502/504 error rate > 1 % (monitor via access_log).
Upstream P99 response time > 3 s (Prometheus or access_log).
MySQL slow‑query count > 10 per minute (slow_query_log).
MySQL connection usage > 80 % of max_connections (SHOW GLOBAL STATUS).
MySQL CPU usage > 80 % (system monitor).
InnoDB buffer‑pool hit rate < 95 % (SHOW ENGINE INNODB STATUS).
Server memory usage > 85 % or swap usage > 10 % (free -h).
Disk I/O utilization > 80 % (iostat).
Example Prometheus scrape configuration for the MySQL exporter and common PromQL queries (QPS, slow‑query rate, connection usage, buffer‑pool hit rate, lock wait count) are provided in the original article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
