Databases 34 min read

How a Single Slow SQL Crashed Redis Pools and Triggered a Microservice Avalanche

In a microservice‑based e‑commerce platform, a seemingly harmless slow MySQL query exhausted the database connection pool, blocked Redis connections, and caused a chain reaction that filled the Redis connection pool, leading to widespread service timeouts and a full‑scale system avalanche during a high‑traffic promotion.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How a Single Slow SQL Crashed Redis Pools and Triggered a Microservice Avalanche

Introduction

With microservice architectures becoming ubiquitous, system complexity grows exponentially. A seemingly trivial database slow query can act like a butterfly effect, gradually evolving into a catastrophic incident that overwhelms the entire system. This article dissects a real production incident: how a single slow SQL exhausted the Redis connection pool and caused a microservice cluster avalanche, offering a comprehensive test of ops engineers' troubleshooting and performance‑optimisation skills.

Technical Background

Connection‑Pool Mechanism in Microservices

In microservice architectures, services communicate over the network, while databases and caches serve as core infrastructure. Connection pools reuse connections to avoid the overhead of frequent creation and destruction. Typical pool parameters include:

minIdle : minimum idle connections kept in the pool

maxTotal : maximum number of connections allowed

maxWaitMillis : maximum wait time for a connection

connectTimeout : timeout for establishing a connection

soTimeout : data‑transfer timeout

When the pool is exhausted, new requests wait; if the wait exceeds the timeout, a connection‑acquisition failure is thrown. Under normal conditions this works well, but performance problems in upstream dependencies (e.g., the database) can trigger cascading failures.

Redis’s Role in Microservices

Redis, as a high‑performance in‑memory store, fulfills multiple roles:

Cache layer : caches hot data to relieve database pressure

Session store : manages distributed sessions

Message queue : provides async pub/sub communication

Distributed lock : ensures consistency in distributed environments

Rate limiter : controls API access frequency

Because Redis underpins so many critical functions, exhausting its connection pool can have disastrous consequences for the entire microservice ecosystem.

Chain Reaction of a Database Slow Query

A slow query is more than a single SQL taking long to execute; in microservices its impact multiplies:

Connection occupation : the slow query holds a DB connection for an extended period

Connection‑pool exhaustion : many requests pile up, draining the DB pool

Thread blocking : application threads wait for DB responses

Redis timeout : business logic first queries the DB then updates Redis; the DB delay keeps the Redis connection open

Redis connection‑pool exhaustion : all Redis connections become occupied

Service avalanche : dependent services also slow down, propagating the failure

Core Content: Full‑Scope Failure Analysis

Symptoms

# Service response time monitoring
Service-A: avg_response_time = 15000ms (normal < 200ms)
Service‑B: avg_response_time = 8000ms (normal < 150ms)
Service‑C: avg_response_time = 12000ms (normal < 300ms)

# Redis connection‑pool alert
Redis Connection Pool Exhausted: 100/100 connections in use
Wait time: 5000ms+

# MySQL connection‑pool status
MySQL Connection Pool: 98/100 connections active
Active transactions: 156

Step 1: System Performance Analysis

Login to the application server and check overall load:

# Check system load
uptime
# Example output: 15:23:45 up 45 days, 3:12, 2 users, load average: 8.45, 7.23, 6.89

# CPU usage
top -bn1 | head -20

# Memory usage
free -h
# Total Used: 48G, Free: 2.1G, etc.

# Network connections
netstat -antp | grep ESTABLISHED | wc -l
# Output: 8234 (abnormally high)

# TCP connection distribution
ss -s
# TCP: 8456 (estab 8234, closed 156, ...)

The load is high, but CPU is not saturated, suggesting I/O or network wait.

Step 2: Redis Connection‑Pool Analysis

# Redis client count
redis-cli info clients
# connected_clients:892
# blocked_clients:45

# Slow‑query log
redis-cli slowlog get 10
# Example entry shows a dangerous KEYS command taking 15234 µs

# Connection details
redis-cli client list | head -20
# id=12345 addr=192.168.1.100:45678 fd=8 age=234 idle=0 flags=N db=0 ...

Many connections have idle=0, indicating they are actively used; the number of blocked clients is abnormal.

Redis Configuration Optimisation

# Edit /etc/redis/redis.conf
maxclients 10000            # increase max clients
timeout 300                # client idle timeout (seconds)
tcp-keepalive 300
tcp-backlog 511

maxmemory 8gb
maxmemory-policy allkeys-lru

# Disable RDB persistence, enable AOF
save ""
appendonly yes
appendfsync everysec

slowlog-log-slower-than 10000
slowlog-max-len 128

# Rename dangerous commands
rename-command FLUSHDB ""
rename-command FLUSHALL ""
rename-command KEYS ""
rename-command CONFIG "CONFIG_ADMIN"

systemctl restart redis

Application‑Level Redis Pool Settings (Jedis example)

# application.yml
spring.redis.host=192.168.1.50
spring.redis.port=6379
spring.redis.password=your_password

spring.redis.jedis.pool.max-active=200   # increase from 50
spring.redis.jedis.pool.max-idle=50
spring.redis.jedis.pool.min-idle=10
spring.redis.jedis.pool.max-wait=3000
spring.redis.timeout=5000
spring.redis.jedis.pool.time-between-eviction-runs=30000
spring.redis.jedis.pool.min-evictable-idle-time=60000

# Lettuce alternative
spring.redis.lettuce.pool.max-active=200
spring.redis.lettuce.pool.max-idle=50
spring.redis.lettuce.pool.min-idle=10
spring.redis.lettuce.pool.max-wait=3000
spring.redis.lettuce.shutdown-timeout=100

Step 3: Database Slow‑Query Analysis

# Show running queries
mysql -uroot -p -e "SHOW FULL PROCESSLIST;" | grep -v "Sleep"
# Example: Query running 45 s on orders table

# Check slow‑query log settings
mysql -uroot -p -e "SHOW VARIABLES LIKE 'slow_query_log%';"
# slow_query_log = ON, file = /var/log/mysql/slow-query.log

# Analyse slow‑query log
tail -100 /var/log/mysql/slow-query.log
mysqldumpslow -s t -t 10 /var/log/mysql/slow-query.log
# Example: SELECT * FROM orders WHERE user_id=N AND status IN (N,N,N) AND create_time > 'S'

# Inspect table indexes
mysql -uroot -p -e "USE production; SHOW INDEX FROM orders;"

# Explain problematic SQL
mysql -uroot -p production -e "EXPLAIN SELECT * FROM orders WHERE user_id=12345 AND status IN (1,2,3) AND create_time > '2024-01-01';"
# Result shows full table scan

The orders table lacks suitable indexes, causing massive full‑table scans.

MySQL Performance Optimisation

# Edit /etc/my.cnf
[mysqld]
port = 3306
max_connections = 500          # raise from 100
connect_timeout = 10
wait_timeout = 600
interactive_timeout = 600
max_allowed_packet = 64M

# InnoDB buffer pool
innodb_buffer_pool_size = 40G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 2G
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000

# Disable query cache (MySQL 5.7‑)
query_cache_type = 0
query_cache_size = 0

# Slow‑query log
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow-query.log
long_query_time = 1
log_queries_not_using_indexes = 1

Index Optimisation

# Analyse current indexes
USE production;
SHOW INDEX FROM orders;

# Create composite index for the slow query
CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time);

# Verify index usage
EXPLAIN SELECT * FROM orders WHERE user_id=12345 AND status IN (1,2,3) AND create_time > '2024-01-01';
# Expect type=range, rows dramatically reduced

# Additional indexes
CREATE INDEX idx_order_no ON orders(order_no);
CREATE INDEX idx_create_time ON orders(create_time);
CREATE INDEX idx_user_status ON orders(user_id, status);

# Analyse index size and redundancy
SELECT TABLE_NAME, INDEX_NAME, ROUND(STAT_VALUE*@innodb_page_size/1024/1024,2) AS 'Size (MB)'
FROM mysql.innodb_index_stats
WHERE TABLE_NAME='orders';

# Detect redundant indexes
SELECT a.TABLE_SCHEMA, a.TABLE_NAME, a.INDEX_NAME AS 'Redundant Index', a.COLUMN_NAME
FROM information_schema.STATISTICS a
WHERE EXISTS (
  SELECT 1 FROM information_schema.STATISTICS b
  WHERE a.TABLE_SCHEMA=b.TABLE_SCHEMA AND a.TABLE_NAME=b.TABLE_NAME
    AND a.INDEX_NAME!=b.INDEX_NAME AND a.COLUMN_NAME=b.COLUMN_NAME AND a.SEQ_IN_INDEX=1);

Database Connection‑Pool Optimisation (HikariCP example)

# application.properties
spring.datasource.url=jdbc:mysql://192.168.1.50:3306/production?useUnicode=true&characterEncoding=utf8&useSSL=false&serverTimezone=Asia/Shanghai
spring.datasource.username=app_user
spring.datasource.password=your_password
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver

# HikariCP settings
spring.datasource.hikari.minimum-idle=20
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.connection-timeout=30000
spring.datasource.hikari.idle-timeout=600000
spring.datasource.hikari.max-lifetime=1800000
spring.datasource.hikari.connection-test-query=SELECT 1
spring.datasource.hikari.leak-detection-threshold=60000
spring.datasource.hikari.auto-commit=true
spring.datasource.hikari.pool-name=HikariPool-Orders
spring.datasource.hikari.register-mbeans=true

# MySQL‑specific optimisations
spring.datasource.hikari.data-source-properties.cachePrepStmts=true
spring.datasource.hikari.data-source-properties.prepStmtCacheSize=250
spring.datasource.hikari.data-source-properties.prepStmtCacheSqlLimit=2048
spring.datasource.hikari.data-source-properties.useServerPrepStmts=true
spring.datasource.hikari.data-source-properties.rewriteBatchedStatements=true

System‑Level I/O Optimisation

# Check disk scheduler
cat /sys/block/sda/queue/scheduler
# Switch to deadline for DB workloads
echo deadline > /sys/block/sda/queue/scheduler

# Persist via GRUB
vim /etc/default/grub   # add elevator=deadline
grub2-mkconfig -o /boot/grub2/grub.cfg

# Mount options for MySQL data dir
vim /etc/fstab
# /dev/sda1 /var/lib/mysql ext4 noatime,nodiratime,data=writeback 0 0
mount -o remount /var/lib/mysql

# Kernel network tweaks
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
sysctl -w vm.swappiness=10
sysctl -p

Practical Case: Full Recovery Process

Background

An internet finance platform with >100 service instances processes >5 million orders daily. During a promotion, massive timeouts occurred.

Timeline

14:00 Promotion starts, traffic rises 14:15 Some services show increased latency 14:23 Redis pool alert – 100% usage 14:25 Core services unavailable, P0 incident 14:35 Root cause identified as DB slow query 15:10 Index created, services recover 15:30 System fully restored

Emergency Steps

Stage 1: Quick Stop‑Bleed (14:28‑14:40)

# Expand Redis pool temporarily
kubectl set env deployment/order-service REDIS_MAX_ACTIVE=300

# Enable degradation for non‑critical features
curl -X POST http://admin-api/config/degrade -d "service=order-query&enabled=true&threshold=50"

# Kill long‑running SQLs
mysql -uroot -p -e "SELECT CONCAT('KILL ',id,';') FROM information_schema.processlist WHERE user='app_user' AND time>30 AND command='Query';" | mysql -uroot -p

# Restart part of the service to release connections
kubectl rollout restart deployment/order-service

Stage 2: Root‑Cause Diagnosis (14:40‑15:00)

# Collect slow‑query log
scp mysql-server:/var/log/mysql/slow-query.log /tmp/

# Analyse top slow queries
mysqldumpslow -s t -t 20 /tmp/slow-query.log > /tmp/slow-analysis.txt

# Explain problematic SQL
mysql -uroot -p production <<EOF
EXPLAIN SELECT o.*, oi.* FROM orders o LEFT JOIN order_items oi ON o.id=oi.order_id WHERE o.user_id=12345 AND o.status IN (1,2,3) AND o.create_time>'2024-01-01' ORDER BY o.create_time DESC LIMIT 20;
EOF

# Review table structure and indexes
mysql -uroot -p -e "USE production; SHOW CREATE TABLE orders\G"
mysql -uroot -p -e "USE production; SHOW INDEX FROM orders\G"

Stage 3: Fundamental Fix (15:00‑15:20)

# Create index on replica first (offline test)
USE production;
CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time);

# Verify effect
EXPLAIN SELECT * FROM orders WHERE user_id=12345 AND status IN (1,2,3) AND create_time>'2024-01-01';
# Rows drop from ~890k to ~150

# Apply to primary using pt‑online‑schema‑change
pt-online-schema-change --alter "ADD INDEX idx_user_status_time (user_id, status, create_time)" --execute D=production,t=orders --host=192.168.1.50 --user=admin --password=your_password --chunk-size=5000

# Additional indexes for order_no and create_time
CREATE INDEX idx_order_no ON orders(order_no);
CREATE INDEX idx_create_time ON orders(create_time);

# Analyse index fragmentation
ANALYZE TABLE orders;
OPTIMIZE TABLE orders;  # may lock table, run during low traffic

Stage 4: Config Optimisation & Verification (15:20‑15:30)

# Update MySQL config (as shown earlier) and perform graceful restart via master‑slave switch
mysql -uroot -p -h 192.168.1.51 -e "STOP SLAVE; RESET SLAVE ALL;"
# Switch application datasource to replica, restart, then switch back after MySQL restart
systemctl restart mysqld

# Update application pool sizes
kubectl set env deployment/order-service HIKARI_MAXIMUM_POOL_SIZE=100 HIKARI_MINIMUM_IDLE=20 REDIS_MAX_ACTIVE=200
kubectl rollout restart deployment/order-service

# Verify metrics
watch -n5 "mysql -uroot -p -e 'SHOW FULL PROCESSLIST' | wc -l"
watch -n5 "redis-cli info clients | grep connected_clients"
curl http://prometheus-api/api/v1/query?query=avg_response_time_ms

Post‑mortem & Improvements

Root‑Cause Analysis

Direct cause : missing composite index on orders leading to slow query

Indirect cause : undersized DB and Redis connection pools

Deep causes :

Lack of capacity planning and load testing

Inadequate alert thresholds

Missing automated slow‑query detection

No circuit‑breaker or degradation mechanism

Long‑Term Measures

# Deploy slow‑query monitoring (pt‑query‑digest)
0 */4 * * * pt-query-digest /var/log/mysql/slow-query.log > /tmp/slow-digest-$(date +%Y%m%d).txt

# Prometheus MySQL exporter
docker run -d --name mysqld-exporter -p 9104:9104 -e DATA_SOURCE_NAME="exporter:password@(192.168.1.50:3306)/" prom/mysqld-exporter

# Redis exporter
docker run -d --name redis-exporter -p 9121:9121 oliver006/redis_exporter --redis.addr=redis://192.168.1.50:6379

# Alert rules (MySQL slow queries, high connections, Redis pool exhaustion)
# (YAML omitted for brevity)

Daily/Weekly/Monthly Ops Checklist

Daily: review slow‑query logs, connection‑pool usage, disk I/O, error logs

Weekly: analyse slow‑query trends, index efficiency, table growth, config changes

Monthly: performance baseline testing, capacity planning, backup‑restore drills, emergency‑plan rehearsals

SQL coding standards: avoid SELECT *, always use WHERE, no functions on indexed columns, limit result sets, avoid large transactions

Conclusion & Outlook

This article dissected a real‑world production incident, showing how a database slow query can exhaust Redis connections and trigger a microservice avalanche. It highlights the tight inter‑dependencies in microservice ecosystems and underscores the importance of monitoring, indexing, connection‑pool sizing, capacity planning, and robust emergency response.

Key Takeaways

Monitoring first : comprehensive observability is essential for rapid detection and diagnosis.

Index optimisation : proper DB index design is critical for performance.

Connection‑pool configuration : appropriately sized pools safeguard system stability.

Capacity planning : regular load testing uncovers bottlenecks before they cause outages.

Emergency response : well‑defined runbooks enable swift mitigation.

Future Trends

With cloud‑native evolution, databases and caches are moving toward automatic scaling, AI‑driven observability, serverless execution, and multi‑model support, demanding ops engineers continuously upgrade skills and refine practices to maintain high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Connection PoolDatabase Performanceslow-query
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.