Databases 14 min read

How a Startup Solved Midnight MySQL Timeouts: Slow‑SQL Diagnosis & Caching

During nightly peaks, a social‑e‑commerce startup experienced hour‑long service outages due to MySQL timeouts; by analyzing traffic spikes, CPU usage, and slow‑SQL logs, the team identified un‑cached ranking queries and a 20‑minute cache refresh bottleneck, then implemented targeted caching, monitoring scripts, and fallback static pages to eliminate the issue.

ITPUB

Jan 19, 2022

How a Startup Solved Midnight MySQL Timeouts: Slow‑SQL Diagnosis & Caching

Introduction

A social‑e‑commerce startup suffered regular one‑hour outages every night between 22:00 and 23:00. All web and app requests timed out because the MySQL database became unresponsive. The article walks through the systematic analysis, root‑cause identification, and remediation steps.

System Overview

The system runs on a public cloud with Nginx as the front‑end gateway. Multiple micro‑services handle business logic, and data is stored in a single MySQL instance with Memcached used for front‑end caching. Data is not strictly isolated per service, a design choice to accommodate rapid business changes.

Symptom Observation

Monitoring showed that during the peak hour the entire site was inaccessible, while outside that window the system recovered automatically, indicating the problem was tied to high request volume rather than a permanent service crash.

Initial Analysis

1. Traffic Spike : Log analysis revealed that the 22:00‑23:00 window coincides with the highest user activity for content‑driven apps.

2. CPU Utilization : MySQL CPU usage hit 100% during the outage, a classic sign of slow‑SQL execution.

3. Slow‑SQL Logs : Examination of the slow‑SQL log highlighted a particularly heavy query used for a “viral‑ranking” feature.

Identified Problem

The problematic query (shown below) aggregates follower counts without any caching, causing massive load during traffic peaks.

select fo.FollowId as vid, count(fo.id) as vcounts
from follow fo, user_info ui
where fo.userid = ui.userid
  and fo.CreateTime between str_to_date(?, '%Y-%m-%d %H:%i:%s')
  and str_to_date(?, '%Y-%m-%d %H:%i:%s')
  and fo.IsDel = 0
  and ui.UserState = 0
group by vid
order by vcounts desc
limit 0,10

After adding a cache for this ranking, the query disappeared from the slow‑SQL log, but the outage persisted.

Further CPU monitoring revealed a regular 20‑minute wave in utilization that did not correlate with request volume. Investigation showed that the homepage cache refresh took up to 15 minutes; because the refresh could not finish within the 10‑minute interval, it was delayed to the next 20‑minute slot, creating a periodic CPU spike that overwhelmed MySQL.

Optimization Steps

Cache the Ranking : Implemented a 10‑minute cache for the ranking query, eliminating its load.

Improve Cache Refresh : Reduced the refresh time and staggered the schedule to avoid overlapping with traffic peaks.

Deploy a Slow‑SQL Killer : Added a script that runs every minute, kills any query running longer than one minute, preventing a single slow query from blocking the database.

Static Fallback Page : Configured Nginx to serve a minimal static homepage when the dynamic homepage times out, ensuring basic navigation remains available.

Long‑Term Recommendations : Suggest data isolation per micro‑service, separate read replicas for analytics, and more aggressive cache‑eviction policies.

Results

Post‑deployment monitoring showed the CPU utilization curve returned to normal, and the nightly outage disappeared. The combination of caching, proactive query termination, and a static fallback restored system stability.

Lessons Learned

Always evaluate the data volume and execution cost of SQL statements before deployment.

Cache heavy‑read queries, especially those displayed on high‑traffic pages.

Implement automated monitoring and remediation for slow queries.

Design a graceful degradation path (static fallback) for critical front‑end services.

Further Improvements

Future work includes stricter data isolation, moving non‑critical queries to read‑only replicas, and upgrading to a master‑slave MySQL architecture to further reduce the impact of heavy read workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Caching mysql incident analysis slow-sql

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.