Operations 15 min read

How a 3 AM MySQL Crash Taught Me Essential Ops Lessons

This article recounts a 3 AM MySQL outage, analyzes its root causes, and shares comprehensive operational strategies—including index optimization, connection‑pool tuning, slow‑query fixing, replication lag handling, monitoring metrics, automation scripts, performance tuning, security hardening, and future trends—to help DBAs prevent and resolve similar incidents.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How a 3 AM MySQL Crash Taught Me Essential Ops Lessons

MySQL Ops Blood and Tears: A 3 AM Production Incident

Introduction: The Slow Query That Kept Me Up All Night

At 3 AM, a frantic phone call warned that the system was dead, CPU at 100%, and QPS at zero. Logging into the server revealed a flood of "Sending data" queries, a classic symptom of a missing index.

Background: Why MySQL Ops Matters

MySQL powers over 80% of internet companies, from startups to giants. Poor MySQL operations can cause business interruption, data loss, performance collapse, and security risks.

Business interruption : Service unavailability directly impacts user experience and revenue.

Data loss : Irreplaceable data disappears.

Performance collapse : A single slow query can bring down the whole system.

Security risk : SQL injection, improper permission management, and other vulnerabilities.

Core Experience Sharing: Pitfalls We’ve Hit

1. Index Optimization: More Isn’t Always Better

Common Misconception : Many beginners think adding more indexes always speeds up queries.

Pitfall Example : In an e‑commerce project, adding over 20 indexes to a product table reduced insert performance dramatically—from 1 000 rows per second to just 50.

Best Practice :

-- Bad example: over‑indexing
CREATE INDEX idx_create_time ON products(create_time);
CREATE INDEX idx_update_time ON products(update_time);
CREATE INDEX idx_category_id ON products(category_id);
CREATE INDEX idx_brand_id ON products(brand_id);
-- ... many single‑column indexes

-- Good example: composite index
CREATE INDEX idx_category_brand_time ON products(category_id, brand_id, create_time);

2. Connection Pool Configuration: Avoid “Starving” or “Over‑Loading”

A project set the application pool size to 500 while MySQL max_connections was only 100, causing connection‑timeout errors under high concurrency.

Correct Configuration Idea :

# Application connection pool
spring.datasource.hikari.maximum-pool-size=50
spring.datasource.hikari.minimum-idle=10

# MySQL server settings
max_connections = 200
max_connect_errors = 100000

Formula : MySQL max_connections ≥ (number of app servers × pool size) × 1.2

3. Slow Query Optimization: Solve the Root Cause

The problematic query scanned a 5 million‑row orders table without an appropriate index:

SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC;

Optimization Steps :

Analyze execution plan :

EXPLAIN SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC;

Create appropriate index :

CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time);

Validate improvement :

-- Before: scanned 5 M rows, took 15 s
-- After: scanned 200 rows, took 0.01 s

4. Master‑Slave Replication: Prevent Lag Bombs

In a financial project, replication lag caused users to see stale balances after a transfer, leading to complaints.

Solution :

# Force read from master
@read_from_master
def get_user_balance_after_transaction(user_id):
    return UserBalance.objects.get(user_id=user_id)

SELECT /*+ READ_FROM_MASTER */ balance FROM user_balance WHERE user_id = ?;

Monitoring System: Make Problems Nowhere to Hide

Key Monitoring Metrics

QPS/TPS : Queries and transactions per second.

Connection usage : current connections / max connections.

Slow query count : number of long‑running SQL statements.

Replication lag : Seconds_Behind_Master.

Buffer pool hit rate : Innodb_buffer_pool_read_requests / Innodb_buffer_pool_reads.

Example Prometheus alert rules:

# Prometheus alert rule example
- alert: MySQLSlowQueries
  expr: rate(mysql_global_status_slow_queries[5m]) > 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "MySQL slow queries too many"
    description: "{{ $labels.instance }} slow query rate > 10/sec"

- alert: MySQLReplicationLag
  expr: mysql_slave_lag_seconds > 30
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "MySQL replication lag too high"
    description: "Replication lag exceeds 30 seconds"

Automation: Liberating Hands

1. Automated Backup Script

#!/bin/bash
# mysql_backup.sh
BACKUP_DIR="/data/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="your_database"
mysqldump -u backup_user -p'backup_password' \
    --single-transaction \
    --routines \
    --triggers \
    --master-data=2 $DB_NAME | gzip > $BACKUP_DIR/${DB_NAME}_${DATE}.sql.gz
# Clean backups older than 7 days
find $BACKUP_DIR -name "*.sql.gz" -mtime +7 -delete
if [ $? -eq 0 ]; then
    echo "Backup succeeded: ${DB_NAME}_${DATE}.sql.gz" | mail -s "MySQL backup success" [email protected]
else
    echo "Backup failed!" | mail -s "MySQL backup failure" [email protected]
fi

2. Health‑Check Automation

import pymysql, time
from datetime import datetime

def check_mysql_health():
    try:
        conn = pymysql.connect(host='localhost', user='monitor_user', password='monitor_password', db='information_schema')
        cursor = conn.cursor()
        cursor.execute("SHOW STATUS LIKE 'Threads_connected'")
        current = int(cursor.fetchone()[1])
        cursor.execute("SHOW VARIABLES LIKE 'max_connections'")
        max_conn = int(cursor.fetchone()[1])
        usage = current / max_conn * 100
        if usage > 80:
            send_alert(f"MySQL connection usage high: {usage:.1f}%")
        cursor.execute("SHOW STATUS LIKE 'Slow_queries'")
        slow = int(cursor.fetchone()[1])
        # Additional analysis could be added here
        conn.close()
    except Exception as e:
        send_alert(f"MySQL health check failed: {str(e)}")

def send_alert(message):
    print(f"[{datetime.now()}] ALERT: {message}")

if __name__ == "__main__":
    check_mysql_health()

1. InnoDB Parameter Tuning

# my.cnf core parameters
[mysqld]
innodb_buffer_pool_size = 8G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 1G
innodb_log_files_in_group = 3
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000

2. Query Cache Trade‑offs

Important Note : MySQL 8.0 removed the query cache because it becomes a bottleneck under high concurrency.

# MySQL 5.7 and below
query_cache_type = 0
query_cache_size = 0

1. Permission Management Best Practices

-- Create dedicated account, avoid root
CREATE USER 'app_user'@'192.168.1.%' IDENTIFIED BY 'StrongPassword123!';
GRANT SELECT, INSERT, UPDATE, DELETE ON app_db.* TO 'app_user'@'192.168.1.%';

-- Backup account
CREATE USER 'backup_user'@'localhost' IDENTIFIED BY 'BackupPassword456!';
GRANT SELECT, LOCK TABLES, SHOW VIEW ON *.* TO 'backup_user'@'localhost';

-- Monitoring account
CREATE USER 'monitor_user'@'localhost' IDENTIFIED BY 'MonitorPassword789!';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'monitor_user'@'localhost';

2. SQL Injection Protection

# Bad example: string concatenation
def get_user_bad(user_id):
    sql = f"SELECT * FROM users WHERE id = {user_id}"

# Good example: parameterized query
def get_user_good(user_id):
    sql = "SELECT * FROM users WHERE id = %s"
    cursor.execute(sql, (user_id,))

Future Outlook: MySQL’s Road Ahead

1. Cloud‑Native MySQL

Operators for Kubernetes, DBaaS services (e.g., Alibaba Cloud RDS, Tencent Cloud CDB), and serverless databases are reshaping deployment and ops.

2. Emerging Storage Engines

MyRocks (RocksDB‑based), TokuDB (high compression), and ColumnStore (columnar for OLAP) offer new performance characteristics.

3. AI‑Powered DB Ops

Intelligent index recommendation, anomaly detection via machine learning, and automatic parameter tuning are becoming mainstream.

4. Multi‑Active Architecture Evolution

Traditional master‑slave → master‑master → multi‑region active‑active → unitized architecture

Conclusion: From Beginner to Mastery

Reflecting on years of MySQL ops, I realize that ops is both a technical skill and an art; each incident is a growth opportunity, each optimization a cumulative experience, and the joy comes from turning a slow query into a ten‑fold QPS boost, catching failures early with monitoring, and automating repetitive work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringAutomationmysqlSecurityDatabase operations
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.