How a 3 AM MySQL Crash Taught Me Essential Ops Lessons
This article recounts a 3 AM MySQL outage, analyzes its root causes, and shares comprehensive operational strategies—including index optimization, connection‑pool tuning, slow‑query fixing, replication lag handling, monitoring metrics, automation scripts, performance tuning, security hardening, and future trends—to help DBAs prevent and resolve similar incidents.
MySQL Ops Blood and Tears: A 3 AM Production Incident
Introduction: The Slow Query That Kept Me Up All Night
At 3 AM, a frantic phone call warned that the system was dead, CPU at 100%, and QPS at zero. Logging into the server revealed a flood of "Sending data" queries, a classic symptom of a missing index.
Background: Why MySQL Ops Matters
MySQL powers over 80% of internet companies, from startups to giants. Poor MySQL operations can cause business interruption, data loss, performance collapse, and security risks.
Business interruption : Service unavailability directly impacts user experience and revenue.
Data loss : Irreplaceable data disappears.
Performance collapse : A single slow query can bring down the whole system.
Security risk : SQL injection, improper permission management, and other vulnerabilities.
Core Experience Sharing: Pitfalls We’ve Hit
1. Index Optimization: More Isn’t Always Better
Common Misconception : Many beginners think adding more indexes always speeds up queries.
Pitfall Example : In an e‑commerce project, adding over 20 indexes to a product table reduced insert performance dramatically—from 1 000 rows per second to just 50.
Best Practice :
-- Bad example: over‑indexing
CREATE INDEX idx_create_time ON products(create_time);
CREATE INDEX idx_update_time ON products(update_time);
CREATE INDEX idx_category_id ON products(category_id);
CREATE INDEX idx_brand_id ON products(brand_id);
-- ... many single‑column indexes
-- Good example: composite index
CREATE INDEX idx_category_brand_time ON products(category_id, brand_id, create_time);2. Connection Pool Configuration: Avoid “Starving” or “Over‑Loading”
A project set the application pool size to 500 while MySQL max_connections was only 100, causing connection‑timeout errors under high concurrency.
Correct Configuration Idea :
# Application connection pool
spring.datasource.hikari.maximum-pool-size=50
spring.datasource.hikari.minimum-idle=10
# MySQL server settings
max_connections = 200
max_connect_errors = 100000Formula : MySQL max_connections ≥ (number of app servers × pool size) × 1.2
3. Slow Query Optimization: Solve the Root Cause
The problematic query scanned a 5 million‑row orders table without an appropriate index:
SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC;Optimization Steps :
Analyze execution plan :
EXPLAIN SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC;Create appropriate index :
CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time);Validate improvement :
-- Before: scanned 5 M rows, took 15 s
-- After: scanned 200 rows, took 0.01 s4. Master‑Slave Replication: Prevent Lag Bombs
In a financial project, replication lag caused users to see stale balances after a transfer, leading to complaints.
Solution :
# Force read from master
@read_from_master
def get_user_balance_after_transaction(user_id):
return UserBalance.objects.get(user_id=user_id)
SELECT /*+ READ_FROM_MASTER */ balance FROM user_balance WHERE user_id = ?;Monitoring System: Make Problems Nowhere to Hide
Key Monitoring Metrics
QPS/TPS : Queries and transactions per second.
Connection usage : current connections / max connections.
Slow query count : number of long‑running SQL statements.
Replication lag : Seconds_Behind_Master.
Buffer pool hit rate : Innodb_buffer_pool_read_requests / Innodb_buffer_pool_reads.
Example Prometheus alert rules:
# Prometheus alert rule example
- alert: MySQLSlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL slow queries too many"
description: "{{ $labels.instance }} slow query rate > 10/sec"
- alert: MySQLReplicationLag
expr: mysql_slave_lag_seconds > 30
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL replication lag too high"
description: "Replication lag exceeds 30 seconds"Automation: Liberating Hands
1. Automated Backup Script
#!/bin/bash
# mysql_backup.sh
BACKUP_DIR="/data/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="your_database"
mysqldump -u backup_user -p'backup_password' \
--single-transaction \
--routines \
--triggers \
--master-data=2 $DB_NAME | gzip > $BACKUP_DIR/${DB_NAME}_${DATE}.sql.gz
# Clean backups older than 7 days
find $BACKUP_DIR -name "*.sql.gz" -mtime +7 -delete
if [ $? -eq 0 ]; then
echo "Backup succeeded: ${DB_NAME}_${DATE}.sql.gz" | mail -s "MySQL backup success" [email protected]
else
echo "Backup failed!" | mail -s "MySQL backup failure" [email protected]
fi2. Health‑Check Automation
import pymysql, time
from datetime import datetime
def check_mysql_health():
try:
conn = pymysql.connect(host='localhost', user='monitor_user', password='monitor_password', db='information_schema')
cursor = conn.cursor()
cursor.execute("SHOW STATUS LIKE 'Threads_connected'")
current = int(cursor.fetchone()[1])
cursor.execute("SHOW VARIABLES LIKE 'max_connections'")
max_conn = int(cursor.fetchone()[1])
usage = current / max_conn * 100
if usage > 80:
send_alert(f"MySQL connection usage high: {usage:.1f}%")
cursor.execute("SHOW STATUS LIKE 'Slow_queries'")
slow = int(cursor.fetchone()[1])
# Additional analysis could be added here
conn.close()
except Exception as e:
send_alert(f"MySQL health check failed: {str(e)}")
def send_alert(message):
print(f"[{datetime.now()}] ALERT: {message}")
if __name__ == "__main__":
check_mysql_health()1. InnoDB Parameter Tuning
# my.cnf core parameters
[mysqld]
innodb_buffer_pool_size = 8G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 1G
innodb_log_files_in_group = 3
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 2000
innodb_io_capacity_max = 40002. Query Cache Trade‑offs
Important Note : MySQL 8.0 removed the query cache because it becomes a bottleneck under high concurrency.
# MySQL 5.7 and below
query_cache_type = 0
query_cache_size = 01. Permission Management Best Practices
-- Create dedicated account, avoid root
CREATE USER 'app_user'@'192.168.1.%' IDENTIFIED BY 'StrongPassword123!';
GRANT SELECT, INSERT, UPDATE, DELETE ON app_db.* TO 'app_user'@'192.168.1.%';
-- Backup account
CREATE USER 'backup_user'@'localhost' IDENTIFIED BY 'BackupPassword456!';
GRANT SELECT, LOCK TABLES, SHOW VIEW ON *.* TO 'backup_user'@'localhost';
-- Monitoring account
CREATE USER 'monitor_user'@'localhost' IDENTIFIED BY 'MonitorPassword789!';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'monitor_user'@'localhost';2. SQL Injection Protection
# Bad example: string concatenation
def get_user_bad(user_id):
sql = f"SELECT * FROM users WHERE id = {user_id}"
# Good example: parameterized query
def get_user_good(user_id):
sql = "SELECT * FROM users WHERE id = %s"
cursor.execute(sql, (user_id,))Future Outlook: MySQL’s Road Ahead
1. Cloud‑Native MySQL
Operators for Kubernetes, DBaaS services (e.g., Alibaba Cloud RDS, Tencent Cloud CDB), and serverless databases are reshaping deployment and ops.
2. Emerging Storage Engines
MyRocks (RocksDB‑based), TokuDB (high compression), and ColumnStore (columnar for OLAP) offer new performance characteristics.
3. AI‑Powered DB Ops
Intelligent index recommendation, anomaly detection via machine learning, and automatic parameter tuning are becoming mainstream.
4. Multi‑Active Architecture Evolution
Traditional master‑slave → master‑master → multi‑region active‑active → unitized architectureConclusion: From Beginner to Mastery
Reflecting on years of MySQL ops, I realize that ops is both a technical skill and an art; each incident is a growth opportunity, each optimization a cumulative experience, and the joy comes from turning a slow query into a ten‑fold QPS boost, catching failures early with monitoring, and automating repetitive work.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
