MySQL Ops Nightmares: Lessons from a 3 AM Production Crash
A seasoned DBA recounts a 3 AM MySQL outage caused by an unindexed query, then shares practical strategies for indexing, connection‑pool sizing, slow‑query tuning, replication lag handling, monitoring, automation, performance tuning, security hardening, and future‑proofing MySQL deployments.
Story of the 3 AM Incident
At 3 AM a phone call woke the on‑call engineer to a system freeze: CPU hit 100 %, QPS dropped to zero, and show processlist displayed thousands of "Sending data" threads. The culprit was a full‑table scan on a multi‑million‑row orders table caused by a missing index.
Why MySQL Operations Matter
MySQL powers over 80 % of internet services, from startups to large enterprises. Poor operational practices can lead to business interruption, data loss, performance collapse, and security exposure.
Common Pitfalls and Solutions
1. Index Optimization
Many newcomers assume more indexes always improve performance, but excessive single‑column indexes hurt insert speed. A balanced set of composite indexes is preferred.
-- Bad example: too many single‑column indexes
CREATE INDEX idx_create_time ON products(create_time);
CREATE INDEX idx_update_time ON products(update_time);
CREATE INDEX idx_category_id ON products(category_id);
CREATE INDEX idx_brand_id ON products(brand_id);
-- ... many more
-- Good example: a well‑designed composite index
CREATE INDEX idx_category_brand_time ON products(category_id, brand_id, create_time);Rule of thumb: keep the total number of indexes per table between 5 and 8.
2. Connection‑Pool Configuration
Setting the application pool size larger than MySQL's max_connections leads to connection‑timeout errors under load.
# Application pool (Spring Hikari)
spring.datasource.hikari.maximum-pool-size=50
spring.datasource.hikari.minimum-idle=10
# MySQL server settings
max_connections = 200
max_connect_errors = 100000Formula: MySQL max_connections ≥ (app servers × pool size) × 1.2.
3. Slow‑Query Tuning
The original failing query:
SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC;Without an index this scans 5 M rows and takes 15 seconds.
Analyze execution plan – run EXPLAIN to see full table scans.
Create appropriate index – a composite index on (user_id, status, create_time) covers the WHERE and ORDER BY clauses.
Validate improvement – re‑run the query and compare scan rows and latency.
EXPLAIN SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC; CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time); -- Before: 5 M rows scanned, 15 s
-- After: 200 rows scanned, 0.01 s4. Replication Lag
In a financial app, reading from a lagging replica returned stale balances, causing customer complaints. The fix is to force reads from the primary or use read‑write splitting hints.
# Force read from master (Python example)
@read_from_master
def get_user_balance_after_transaction(user_id):
return UserBalance.objects.get(user_id=user_id)
# Or a SQL hint
SELECT /*+ READ_FROM_MASTER */ balance FROM user_balance WHERE user_id = ?;Monitoring Essentials
QPS/TPS – queries and transactions per second.
Connection usage – current connections / max connections.
Slow‑query count – number of queries exceeding a latency threshold.
Replication lag – Seconds_Behind_Master metric.
InnoDB buffer‑pool hit rate – Innodb_buffer_pool_read_requests / Innodb_buffer_pool_reads.
# Prometheus alert example for slow queries
- alert: MySQLSlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "MySQL slow queries too many"
description: "{{ $labels.instance }} slow query rate > 10/sec"
- alert: MySQLReplicationLag
expr: mysql_slave_lag_seconds > 30
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL replication lag high"Automation Scripts
Backup Script
#!/bin/bash
BACKUP_DIR="/data/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="your_database"
mysqldump -u backup_user -p'backup_password' \
--single-transaction \
--routines \
--triggers \
--master-data=2 \
$DB_NAME | gzip > $BACKUP_DIR/${DB_NAME}_${DATE}.sql.gz
# Remove backups older than 7 days
find $BACKUP_DIR -name "*.sql.gz" -mtime +7 -deleteHealth‑Check Script
import pymysql
from datetime import datetime
def check_mysql_health():
try:
conn = pymysql.connect(host='localhost', user='monitor_user',
password='monitor_password', db='information_schema')
cursor = conn.cursor()
cursor.execute("SHOW STATUS LIKE 'Threads_connected'")
current = int(cursor.fetchone()[1])
cursor.execute("SHOW VARIABLES LIKE 'max_connections'")
max_conn = int(cursor.fetchone()[1])
usage = current / max_conn * 100
if usage > 80:
send_alert(f"MySQL connection usage high: {usage:.1f}%")
cursor.execute("SHOW STATUS LIKE 'Slow_queries'")
slow = int(cursor.fetchone()[1])
# Additional checks could be added here
conn.close()
except Exception as e:
send_alert(f"MySQL health check failed: {str(e)}")
def send_alert(message):
print(f"[{datetime.now()}] ALERT: {message}")
if __name__ == "__main__":
check_mysql_health()Performance Tuning
InnoDB Parameters
[mysqld]
innodb_buffer_pool_size = 8G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 1G
innodb_log_files_in_group = 3
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000Query Cache
MySQL 8.0 removed the query cache because it becomes a bottleneck under high concurrency. For 5.7 and earlier, disable it for write‑heavy workloads:
query_cache_type = 0
query_cache_size = 0Security Hardening
Principle of Least Privilege
-- Dedicated application user
CREATE USER 'app_user'@'192.168.1.%' IDENTIFIED BY 'StrongPassword123!';
GRANT SELECT,INSERT,UPDATE,DELETE ON app_db.* TO 'app_user'@'192.168.1.%';
-- Backup user
CREATE USER 'backup_user'@'localhost' IDENTIFIED BY 'BackupPassword456!';
GRANT SELECT, LOCK TABLES, SHOW VIEW ON *.* TO 'backup_user'@'localhost';
-- Monitoring user
CREATE USER 'monitor_user'@'localhost' IDENTIFIED BY 'MonitorPassword789!';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'monitor_user'@'localhost';SQL‑Injection Prevention
# Bad: string concatenation
def get_user_bad(user_id):
sql = f"SELECT * FROM users WHERE id = {user_id}"
cursor.execute(sql)
# Good: parameterized query
def get_user_good(user_id):
sql = "SELECT * FROM users WHERE id = %s"
cursor.execute(sql, (user_id,))Future Directions
Cloud‑native MySQL: operators for Kubernetes, DBaaS (RDS, CDB), serverless on‑demand instances.
Emerging storage engines: MyRocks (RocksDB‑based), TokuDB (high compression), ColumnStore (OLAP‑optimized).
AI‑assisted ops: automatic index recommendation, anomaly detection, dynamic parameter tuning.
Multi‑active architectures evolving from master‑slave to distributed active‑active clusters.
Tool Recommendations
Monitoring: Prometheus + Grafana, Percona Monitoring, MySQLTuner.
Management: phpMyAdmin, Navicat, Percona Toolkit.
Backup: Percona XtraBackup, mydumper, Binlog2SQL.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
