How a Single Slow SQL Crashed Redis Pools and Triggered a Microservice Avalanche
In a microservice‑based e‑commerce platform, a seemingly harmless slow MySQL query exhausted the database connection pool, blocked Redis connections, and caused a chain reaction that filled the Redis connection pool, leading to widespread service timeouts and a full‑scale system avalanche during a high‑traffic promotion.
Introduction
With microservice architectures becoming ubiquitous, system complexity grows exponentially. A seemingly trivial database slow query can act like a butterfly effect, gradually evolving into a catastrophic incident that overwhelms the entire system. This article dissects a real production incident: how a single slow SQL exhausted the Redis connection pool and caused a microservice cluster avalanche, offering a comprehensive test of ops engineers' troubleshooting and performance‑optimisation skills.
Technical Background
Connection‑Pool Mechanism in Microservices
In microservice architectures, services communicate over the network, while databases and caches serve as core infrastructure. Connection pools reuse connections to avoid the overhead of frequent creation and destruction. Typical pool parameters include:
minIdle : minimum idle connections kept in the pool
maxTotal : maximum number of connections allowed
maxWaitMillis : maximum wait time for a connection
connectTimeout : timeout for establishing a connection
soTimeout : data‑transfer timeout
When the pool is exhausted, new requests wait; if the wait exceeds the timeout, a connection‑acquisition failure is thrown. Under normal conditions this works well, but performance problems in upstream dependencies (e.g., the database) can trigger cascading failures.
Redis’s Role in Microservices
Redis, as a high‑performance in‑memory store, fulfills multiple roles:
Cache layer : caches hot data to relieve database pressure
Session store : manages distributed sessions
Message queue : provides async pub/sub communication
Distributed lock : ensures consistency in distributed environments
Rate limiter : controls API access frequency
Because Redis underpins so many critical functions, exhausting its connection pool can have disastrous consequences for the entire microservice ecosystem.
Chain Reaction of a Database Slow Query
A slow query is more than a single SQL taking long to execute; in microservices its impact multiplies:
Connection occupation : the slow query holds a DB connection for an extended period
Connection‑pool exhaustion : many requests pile up, draining the DB pool
Thread blocking : application threads wait for DB responses
Redis timeout : business logic first queries the DB then updates Redis; the DB delay keeps the Redis connection open
Redis connection‑pool exhaustion : all Redis connections become occupied
Service avalanche : dependent services also slow down, propagating the failure
Core Content: Full‑Scope Failure Analysis
Symptoms
# Service response time monitoring
Service-A: avg_response_time = 15000ms (normal < 200ms)
Service‑B: avg_response_time = 8000ms (normal < 150ms)
Service‑C: avg_response_time = 12000ms (normal < 300ms)
# Redis connection‑pool alert
Redis Connection Pool Exhausted: 100/100 connections in use
Wait time: 5000ms+
# MySQL connection‑pool status
MySQL Connection Pool: 98/100 connections active
Active transactions: 156Step 1: System Performance Analysis
Login to the application server and check overall load:
# Check system load
uptime
# Example output: 15:23:45 up 45 days, 3:12, 2 users, load average: 8.45, 7.23, 6.89
# CPU usage
top -bn1 | head -20
# Memory usage
free -h
# Total Used: 48G, Free: 2.1G, etc.
# Network connections
netstat -antp | grep ESTABLISHED | wc -l
# Output: 8234 (abnormally high)
# TCP connection distribution
ss -s
# TCP: 8456 (estab 8234, closed 156, ...)The load is high, but CPU is not saturated, suggesting I/O or network wait.
Step 2: Redis Connection‑Pool Analysis
# Redis client count
redis-cli info clients
# connected_clients:892
# blocked_clients:45
# Slow‑query log
redis-cli slowlog get 10
# Example entry shows a dangerous KEYS command taking 15234 µs
# Connection details
redis-cli client list | head -20
# id=12345 addr=192.168.1.100:45678 fd=8 age=234 idle=0 flags=N db=0 ...Many connections have idle=0, indicating they are actively used; the number of blocked clients is abnormal.
Redis Configuration Optimisation
# Edit /etc/redis/redis.conf
maxclients 10000 # increase max clients
timeout 300 # client idle timeout (seconds)
tcp-keepalive 300
tcp-backlog 511
maxmemory 8gb
maxmemory-policy allkeys-lru
# Disable RDB persistence, enable AOF
save ""
appendonly yes
appendfsync everysec
slowlog-log-slower-than 10000
slowlog-max-len 128
# Rename dangerous commands
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command KEYS ""
rename-command CONFIG "CONFIG_ADMIN"
systemctl restart redisApplication‑Level Redis Pool Settings (Jedis example)
# application.yml
spring.redis.host=192.168.1.50
spring.redis.port=6379
spring.redis.password=your_password
spring.redis.jedis.pool.max-active=200 # increase from 50
spring.redis.jedis.pool.max-idle=50
spring.redis.jedis.pool.min-idle=10
spring.redis.jedis.pool.max-wait=3000
spring.redis.timeout=5000
spring.redis.jedis.pool.time-between-eviction-runs=30000
spring.redis.jedis.pool.min-evictable-idle-time=60000
# Lettuce alternative
spring.redis.lettuce.pool.max-active=200
spring.redis.lettuce.pool.max-idle=50
spring.redis.lettuce.pool.min-idle=10
spring.redis.lettuce.pool.max-wait=3000
spring.redis.lettuce.shutdown-timeout=100Step 3: Database Slow‑Query Analysis
# Show running queries
mysql -uroot -p -e "SHOW FULL PROCESSLIST;" | grep -v "Sleep"
# Example: Query running 45 s on orders table
# Check slow‑query log settings
mysql -uroot -p -e "SHOW VARIABLES LIKE 'slow_query_log%';"
# slow_query_log = ON, file = /var/log/mysql/slow-query.log
# Analyse slow‑query log
tail -100 /var/log/mysql/slow-query.log
mysqldumpslow -s t -t 10 /var/log/mysql/slow-query.log
# Example: SELECT * FROM orders WHERE user_id=N AND status IN (N,N,N) AND create_time > 'S'
# Inspect table indexes
mysql -uroot -p -e "USE production; SHOW INDEX FROM orders;"
# Explain problematic SQL
mysql -uroot -p production -e "EXPLAIN SELECT * FROM orders WHERE user_id=12345 AND status IN (1,2,3) AND create_time > '2024-01-01';"
# Result shows full table scanThe orders table lacks suitable indexes, causing massive full‑table scans.
MySQL Performance Optimisation
# Edit /etc/my.cnf
[mysqld]
port = 3306
max_connections = 500 # raise from 100
connect_timeout = 10
wait_timeout = 600
interactive_timeout = 600
max_allowed_packet = 64M
# InnoDB buffer pool
innodb_buffer_pool_size = 40G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 2G
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
# Disable query cache (MySQL 5.7‑)
query_cache_type = 0
query_cache_size = 0
# Slow‑query log
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow-query.log
long_query_time = 1
log_queries_not_using_indexes = 1Index Optimisation
# Analyse current indexes
USE production;
SHOW INDEX FROM orders;
# Create composite index for the slow query
CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time);
# Verify index usage
EXPLAIN SELECT * FROM orders WHERE user_id=12345 AND status IN (1,2,3) AND create_time > '2024-01-01';
# Expect type=range, rows dramatically reduced
# Additional indexes
CREATE INDEX idx_order_no ON orders(order_no);
CREATE INDEX idx_create_time ON orders(create_time);
CREATE INDEX idx_user_status ON orders(user_id, status);
# Analyse index size and redundancy
SELECT TABLE_NAME, INDEX_NAME, ROUND(STAT_VALUE*@innodb_page_size/1024/1024,2) AS 'Size (MB)'
FROM mysql.innodb_index_stats
WHERE TABLE_NAME='orders';
# Detect redundant indexes
SELECT a.TABLE_SCHEMA, a.TABLE_NAME, a.INDEX_NAME AS 'Redundant Index', a.COLUMN_NAME
FROM information_schema.STATISTICS a
WHERE EXISTS (
SELECT 1 FROM information_schema.STATISTICS b
WHERE a.TABLE_SCHEMA=b.TABLE_SCHEMA AND a.TABLE_NAME=b.TABLE_NAME
AND a.INDEX_NAME!=b.INDEX_NAME AND a.COLUMN_NAME=b.COLUMN_NAME AND a.SEQ_IN_INDEX=1);Database Connection‑Pool Optimisation (HikariCP example)
# application.properties
spring.datasource.url=jdbc:mysql://192.168.1.50:3306/production?useUnicode=true&characterEncoding=utf8&useSSL=false&serverTimezone=Asia/Shanghai
spring.datasource.username=app_user
spring.datasource.password=your_password
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
# HikariCP settings
spring.datasource.hikari.minimum-idle=20
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.connection-timeout=30000
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.connection-test-query=SELECT 1
spring.datasource.hikari.leak-detection-threshold=60000
spring.datasource.hikari.auto-commit=true
spring.datasource.hikari.pool-name=HikariPool-Orders
spring.datasource.hikari.register-mbeans=true
# MySQL‑specific optimisations
spring.datasource.hikari.data-source-properties.cachePrepStmts=true
spring.datasource.hikari.data-source-properties.prepStmtCacheSize=250
spring.datasource.hikari.data-source-properties.prepStmtCacheSqlLimit=2048
spring.datasource.hikari.data-source-properties.useServerPrepStmts=true
spring.datasource.hikari.data-source-properties.rewriteBatchedStatements=trueSystem‑Level I/O Optimisation
# Check disk scheduler
cat /sys/block/sda/queue/scheduler
# Switch to deadline for DB workloads
echo deadline > /sys/block/sda/queue/scheduler
# Persist via GRUB
vim /etc/default/grub # add elevator=deadline
grub2-mkconfig -o /boot/grub2/grub.cfg
# Mount options for MySQL data dir
vim /etc/fstab
# /dev/sda1 /var/lib/mysql ext4 noatime,nodiratime,data=writeback 0 0
mount -o remount /var/lib/mysql
# Kernel network tweaks
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w vm.swappiness=10
sysctl -pPractical Case: Full Recovery Process
Background
An internet finance platform with >100 service instances processes >5 million orders daily. During a promotion, massive timeouts occurred.
Timeline
14:00 Promotion starts, traffic rises 14:15 Some services show increased latency 14:23 Redis pool alert – 100% usage 14:25 Core services unavailable, P0 incident 14:35 Root cause identified as DB slow query 15:10 Index created, services recover 15:30 System fully restored
Emergency Steps
Stage 1: Quick Stop‑Bleed (14:28‑14:40)
# Expand Redis pool temporarily
kubectl set env deployment/order-service REDIS_MAX_ACTIVE=300
# Enable degradation for non‑critical features
curl -X POST http://admin-api/config/degrade -d "service=order-query&enabled=true&threshold=50"
# Kill long‑running SQLs
mysql -uroot -p -e "SELECT CONCAT('KILL ',id,';') FROM information_schema.processlist WHERE user='app_user' AND time>30 AND command='Query';" | mysql -uroot -p
# Restart part of the service to release connections
kubectl rollout restart deployment/order-serviceStage 2: Root‑Cause Diagnosis (14:40‑15:00)
# Collect slow‑query log
scp mysql-server:/var/log/mysql/slow-query.log /tmp/
# Analyse top slow queries
mysqldumpslow -s t -t 20 /tmp/slow-query.log > /tmp/slow-analysis.txt
# Explain problematic SQL
mysql -uroot -p production <<EOF
EXPLAIN SELECT o.*, oi.* FROM orders o LEFT JOIN order_items oi ON o.id=oi.order_id WHERE o.user_id=12345 AND o.status IN (1,2,3) AND o.create_time>'2024-01-01' ORDER BY o.create_time DESC LIMIT 20;
EOF
# Review table structure and indexes
mysql -uroot -p -e "USE production; SHOW CREATE TABLE orders\G"
mysql -uroot -p -e "USE production; SHOW INDEX FROM orders\G"Stage 3: Fundamental Fix (15:00‑15:20)
# Create index on replica first (offline test)
USE production;
CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time);
# Verify effect
EXPLAIN SELECT * FROM orders WHERE user_id=12345 AND status IN (1,2,3) AND create_time>'2024-01-01';
# Rows drop from ~890k to ~150
# Apply to primary using pt‑online‑schema‑change
pt-online-schema-change --alter "ADD INDEX idx_user_status_time (user_id, status, create_time)" --execute D=production,t=orders --host=192.168.1.50 --user=admin --password=your_password --chunk-size=5000
# Additional indexes for order_no and create_time
CREATE INDEX idx_order_no ON orders(order_no);
CREATE INDEX idx_create_time ON orders(create_time);
# Analyse index fragmentation
ANALYZE TABLE orders;
OPTIMIZE TABLE orders; # may lock table, run during low trafficStage 4: Config Optimisation & Verification (15:20‑15:30)
# Update MySQL config (as shown earlier) and perform graceful restart via master‑slave switch
mysql -uroot -p -h 192.168.1.51 -e "STOP SLAVE; RESET SLAVE ALL;"
# Switch application datasource to replica, restart, then switch back after MySQL restart
systemctl restart mysqld
# Update application pool sizes
kubectl set env deployment/order-service HIKARI_MAXIMUM_POOL_SIZE=100 HIKARI_MINIMUM_IDLE=20 REDIS_MAX_ACTIVE=200
kubectl rollout restart deployment/order-service
# Verify metrics
watch -n5 "mysql -uroot -p -e 'SHOW FULL PROCESSLIST' | wc -l"
watch -n5 "redis-cli info clients | grep connected_clients"
curl http://prometheus-api/api/v1/query?query=avg_response_time_msPost‑mortem & Improvements
Root‑Cause Analysis
Direct cause : missing composite index on orders leading to slow query
Indirect cause : undersized DB and Redis connection pools
Deep causes :
Lack of capacity planning and load testing
Inadequate alert thresholds
Missing automated slow‑query detection
No circuit‑breaker or degradation mechanism
Long‑Term Measures
# Deploy slow‑query monitoring (pt‑query‑digest)
0 */4 * * * pt-query-digest /var/log/mysql/slow-query.log > /tmp/slow-digest-$(date +%Y%m%d).txt
# Prometheus MySQL exporter
docker run -d --name mysqld-exporter -p 9104:9104 -e DATA_SOURCE_NAME="exporter:password@(192.168.1.50:3306)/" prom/mysqld-exporter
# Redis exporter
docker run -d --name redis-exporter -p 9121:9121 oliver006/redis_exporter --redis.addr=redis://192.168.1.50:6379
# Alert rules (MySQL slow queries, high connections, Redis pool exhaustion)
# (YAML omitted for brevity)Daily/Weekly/Monthly Ops Checklist
Daily: review slow‑query logs, connection‑pool usage, disk I/O, error logs
Weekly: analyse slow‑query trends, index efficiency, table growth, config changes
Monthly: performance baseline testing, capacity planning, backup‑restore drills, emergency‑plan rehearsals
SQL coding standards: avoid SELECT *, always use WHERE, no functions on indexed columns, limit result sets, avoid large transactions
Conclusion & Outlook
This article dissected a real‑world production incident, showing how a database slow query can exhaust Redis connections and trigger a microservice avalanche. It highlights the tight inter‑dependencies in microservice ecosystems and underscores the importance of monitoring, indexing, connection‑pool sizing, capacity planning, and robust emergency response.
Key Takeaways
Monitoring first : comprehensive observability is essential for rapid detection and diagnosis.
Index optimisation : proper DB index design is critical for performance.
Connection‑pool configuration : appropriately sized pools safeguard system stability.
Capacity planning : regular load testing uncovers bottlenecks before they cause outages.
Emergency response : well‑defined runbooks enable swift mitigation.
Future Trends
With cloud‑native evolution, databases and caches are moving toward automatic scaling, AI‑driven observability, serverless execution, and multi‑model support, demanding ops engineers continuously upgrade skills and refine practices to maintain high availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
