Operations 14 min read

MySQL Ops Nightmares: Lessons from a 3 AM Production Crash

A seasoned DBA recounts a 3 AM MySQL outage caused by an unindexed query, then shares practical strategies for indexing, connection‑pool sizing, slow‑query tuning, replication lag handling, monitoring, automation, performance tuning, security hardening, and future‑proofing MySQL deployments.

Raymond Ops

Feb 8, 2026

MySQL Ops Nightmares: Lessons from a 3 AM Production Crash

Story of the 3 AM Incident

At 3 AM a phone call woke the on‑call engineer to a system freeze: CPU hit 100 %, QPS dropped to zero, and show processlist displayed thousands of "Sending data" threads. The culprit was a full‑table scan on a multi‑million‑row orders table caused by a missing index.

Why MySQL Operations Matter

MySQL powers over 80 % of internet services, from startups to large enterprises. Poor operational practices can lead to business interruption, data loss, performance collapse, and security exposure.

Common Pitfalls and Solutions

1. Index Optimization

Many newcomers assume more indexes always improve performance, but excessive single‑column indexes hurt insert speed. A balanced set of composite indexes is preferred.

-- Bad example: too many single‑column indexes
CREATE INDEX idx_create_time   ON products(create_time);
CREATE INDEX idx_update_time   ON products(update_time);
CREATE INDEX idx_category_id  ON products(category_id);
CREATE INDEX idx_brand_id     ON products(brand_id);
-- ... many more

-- Good example: a well‑designed composite index
CREATE INDEX idx_category_brand_time ON products(category_id, brand_id, create_time);

Rule of thumb: keep the total number of indexes per table between 5 and 8.

2. Connection‑Pool Configuration

Setting the application pool size larger than MySQL's max_connections leads to connection‑timeout errors under load.

# Application pool (Spring Hikari)
spring.datasource.hikari.maximum-pool-size=50
spring.datasource.hikari.minimum-idle=10

# MySQL server settings
max_connections = 200
max_connect_errors = 100000

Formula: MySQL max_connections ≥ (app servers × pool size) × 1.2.

3. Slow‑Query Tuning

The original failing query:

SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC;

Without an index this scans 5 M rows and takes 15 seconds.

Analyze execution plan – run EXPLAIN to see full table scans.

Create appropriate index – a composite index on (user_id, status, create_time) covers the WHERE and ORDER BY clauses.

Validate improvement – re‑run the query and compare scan rows and latency.

EXPLAIN SELECT * FROM orders WHERE user_id = 12345 AND status IN ('pending','processing') ORDER BY create_time DESC;

CREATE INDEX idx_user_status_time ON orders(user_id, status, create_time);

-- Before: 5 M rows scanned, 15 s
-- After: 200 rows scanned, 0.01 s

4. Replication Lag

In a financial app, reading from a lagging replica returned stale balances, causing customer complaints. The fix is to force reads from the primary or use read‑write splitting hints.

# Force read from master (Python example)
@read_from_master
def get_user_balance_after_transaction(user_id):
    return UserBalance.objects.get(user_id=user_id)

# Or a SQL hint
SELECT /*+ READ_FROM_MASTER */ balance FROM user_balance WHERE user_id = ?;

Monitoring Essentials

QPS/TPS – queries and transactions per second.

Connection usage – current connections / max connections.

Slow‑query count – number of queries exceeding a latency threshold.

Replication lag – Seconds_Behind_Master metric.

InnoDB buffer‑pool hit rate – Innodb_buffer_pool_read_requests / Innodb_buffer_pool_reads.

# Prometheus alert example for slow queries
- alert: MySQLSlowQueries
  expr: rate(mysql_global_status_slow_queries[5m]) > 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "MySQL slow queries too many"
    description: "{{ $labels.instance }} slow query rate > 10/sec"

- alert: MySQLReplicationLag
  expr: mysql_slave_lag_seconds > 30
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "MySQL replication lag high"

Automation Scripts

Backup Script

#!/bin/bash
BACKUP_DIR="/data/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="your_database"

mysqldump -u backup_user -p'backup_password' \
  --single-transaction \
  --routines \
  --triggers \
  --master-data=2 \
  $DB_NAME | gzip > $BACKUP_DIR/${DB_NAME}_${DATE}.sql.gz

# Remove backups older than 7 days
find $BACKUP_DIR -name "*.sql.gz" -mtime +7 -delete

Health‑Check Script

import pymysql
from datetime import datetime

def check_mysql_health():
    try:
        conn = pymysql.connect(host='localhost', user='monitor_user',
                               password='monitor_password', db='information_schema')
        cursor = conn.cursor()
        cursor.execute("SHOW STATUS LIKE 'Threads_connected'")
        current = int(cursor.fetchone()[1])
        cursor.execute("SHOW VARIABLES LIKE 'max_connections'")
        max_conn = int(cursor.fetchone()[1])
        usage = current / max_conn * 100
        if usage > 80:
            send_alert(f"MySQL connection usage high: {usage:.1f}%")
        cursor.execute("SHOW STATUS LIKE 'Slow_queries'")
        slow = int(cursor.fetchone()[1])
        # Additional checks could be added here
        conn.close()
    except Exception as e:
        send_alert(f"MySQL health check failed: {str(e)}")

def send_alert(message):
    print(f"[{datetime.now()}] ALERT: {message}")

if __name__ == "__main__":
    check_mysql_health()

Performance Tuning

InnoDB Parameters

[mysqld]
innodb_buffer_pool_size = 8G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 1G
innodb_log_files_in_group = 3
innodb_flush_log_at_trx_commit = 2
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000

Query Cache

MySQL 8.0 removed the query cache because it becomes a bottleneck under high concurrency. For 5.7 and earlier, disable it for write‑heavy workloads:

query_cache_type = 0
query_cache_size = 0

Security Hardening

Principle of Least Privilege

-- Dedicated application user
CREATE USER 'app_user'@'192.168.1.%' IDENTIFIED BY 'StrongPassword123!';
GRANT SELECT,INSERT,UPDATE,DELETE ON app_db.* TO 'app_user'@'192.168.1.%';

-- Backup user
CREATE USER 'backup_user'@'localhost' IDENTIFIED BY 'BackupPassword456!';
GRANT SELECT, LOCK TABLES, SHOW VIEW ON *.* TO 'backup_user'@'localhost';

-- Monitoring user
CREATE USER 'monitor_user'@'localhost' IDENTIFIED BY 'MonitorPassword789!';
GRANT PROCESS, REPLICATION CLIENT ON *.* TO 'monitor_user'@'localhost';

SQL‑Injection Prevention

# Bad: string concatenation
def get_user_bad(user_id):
    sql = f"SELECT * FROM users WHERE id = {user_id}"
    cursor.execute(sql)

# Good: parameterized query
def get_user_good(user_id):
    sql = "SELECT * FROM users WHERE id = %s"
    cursor.execute(sql, (user_id,))

Future Directions

Cloud‑native MySQL: operators for Kubernetes, DBaaS (RDS, CDB), serverless on‑demand instances.

Emerging storage engines: MyRocks (RocksDB‑based), TokuDB (high compression), ColumnStore (OLAP‑optimized).

AI‑assisted ops: automatic index recommendation, anomaly detection, dynamic parameter tuning.

Multi‑active architectures evolving from master‑slave to distributed active‑active clusters.

Tool Recommendations

Monitoring: Prometheus + Grafana, Percona Monitoring, MySQLTuner.

Management: phpMyAdmin, Navicat, Percona Toolkit.

Backup: Percona XtraBackup, mydumper, Binlog2SQL.

MySQL Database Operations

Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Story of the 3 AM Incident

Why MySQL Operations Matter

Common Pitfalls and Solutions

1. Index Optimization

2. Connection‑Pool Configuration

3. Slow‑Query Tuning

4. Replication Lag

Monitoring Essentials

Automation Scripts

Backup Script

Health‑Check Script

Performance Tuning

InnoDB Parameters

Query Cache

Security Hardening

Principle of Least Privilege

SQL‑Injection Prevention

Future Directions

Tool Recommendations

Raymond Ops

How this landed with the community

Was this worth your time?

0 Comments

Story of the 3 AM Incident