Designing a System That Can Survive Sudden Spikes of One Million QPS
The article analyzes why simply adding Redis nodes cannot handle a sudden million‑QPS surge, then presents three practical solutions—key sharding, multi‑level caching with hot‑key detection, and distributed‑lock‑based fallback—to build a resilient high‑concurrency backend.
When a news event such as a celebrity breakup goes viral, the associated service can experience an instantaneous surge of up to one million queries per second (QPS). The article examines how to architect a backend that can absorb such spikes without collapsing.
1. Conventional thinking – adding more Redis nodes
Redis is a core component in high‑concurrency systems, and a single Redis instance can handle roughly 100 k QPS. The naive idea is to horizontally scale the Redis cluster to 20 machines. However, because Redis cluster routes a given key to a fixed shard, all traffic for a hot key is still directed to a single node, leaving the other nodes idle. When that node cannot sustain the million QPS, it crashes, and the overload immediately propagates to the downstream MySQL, causing a total system failure.
2. Data‑sharding solution
Since a hot key is bound to a single shard, the article proposes splitting the hot key into many smaller keys, e.g., hot_key_1 … hot_key_100. The client appends a random number (1‑100) to the original key, so each request is mapped to a different shard and the million QPS is evenly distributed across the whole cluster. This approach requires careful pre‑design to ensure the 100 sub‑keys are uniformly placed on the Redis nodes, and it introduces consistency challenges because an update to the logical hot key may need to modify all 100 physical keys.
3. Multi‑level cache solution
A mature design avoids sending all traffic to Redis by introducing a multi‑level cache. The first level is a local in‑process cache (e.g., Caffeine). When a request arrives, the service first checks the local cache; a hit returns the data immediately, shielding Redis from the bulk of the load. Only a small fraction of requests miss the local cache and fall through to Redis.
Because sudden spikes are unpredictable, large‑scale systems deploy a hot‑key detection framework. If a key receives more than a threshold (e.g., 1 000 accesses within one second), the system marks it as hot, pushes the key via a message queue (MQ) to all service nodes, and pre‑loads it into the local cache. This proactive propagation prevents the hot key from overwhelming Redis.
Local caching introduces consistency concerns. Common remedies are a very short TTL (e.g., three seconds) so stale data expires quickly, or broadcasting an invalidation message through MQ. During cache rebuild, a distributed lock ensures that only one thread queries the database while others wait or receive a degraded response, preventing a thundering‑herd effect on MySQL.
For non‑core traffic, a global circuit‑breaker (e.g., Sentinel) can downgrade or reject requests, guaranteeing that core traffic receives priority processing under extreme load.
Summary
Adding more Redis nodes does not help when a hot key is fixed to a single shard.
A hot‑key detection mechanism is required to quickly identify and isolate spikes.
Multi‑level caching offloads the majority of traffic to local memory, reducing Redis pressure.
Distributed locks protect the database during cache miss storms.
Circuit‑breaker degradation prioritizes core requests over non‑core ones.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lobster Programming
Sharing insights on technical analysis and exchange, making life better through technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
