How to Scale a Flash‑Sale System from Zero to 1 Million QPS: A Step‑by‑Step Architecture Guide
This article dissects the evolution of a flash‑sale system from a simple monolithic controller to a cloud‑native, micro‑service architecture that can handle over one million requests per second, detailing traffic‑shaping, multi‑level caching, async processing, and inventory‑consistency techniques.
1 Background
Flash sale is a high‑concurrency scenario where millions of users attempt to purchase a limited‑stock item within seconds. A typical event may generate over 1 million requests in one second while only a few thousand items are sold, demanding both stability and strict inventory consistency.
2 Challenges and Design Principles
2.1 Business Pain Points
Instant traffic spikes: QPS can jump from normal 500 to 800 k or higher.
Oversell risk: non‑atomic read‑modify‑write can cause sales beyond stock.
Read‑heavy, write‑light: >99.5% of requests are queries.
Malicious traffic: bot scripts generate massive invalid requests.
Chain‑reaction failures: a single component fault can cascade to a full outage.
2.2 Core Design Principles
Traffic‑tiered interception – filter invalid requests at client, access, application, and data layers.
Resource isolation – deploy flash‑sale service, cache, and database independently.
Asynchronous processing – off‑load non‑critical paths to a message queue.
Consistency first – guarantee no oversell even at the cost of lower sales.
Monitoring & disaster recovery – real‑time metrics, circuit‑breaker, rapid failover.
3 Five Evolution Stages
3.1 Stage 1 – Monolithic (QPS < 1 000)
Implementation: a single controller directly executes UPDATE stock SET num = num - 1 WHERE id = ?. Result: connection pool saturated at 800 QPS, CPU hits 100 %, all requests time out.
3.2 Stage 2 – Cache Introduction (QPS < 10 000)
Introduce Redis as a cache layer, preload hot inventory, and use the atomic DECR command for stock deduction. After a successful decrement, update the database asynchronously.
Pre‑load inventory into Redis before the event.
Read stock from Redis, bypassing the database.
Use DECR for atomic stock decrement.
Asynchronously persist the change to the database.
Database pressure drops >90 %, response time improves from 500 ms to < 50 ms, QPS reaches 5 000‑10 000. Issues: no rate limiting, cache can still be overwhelmed, single‑point application server.
3.3 Stage 3 – Clustering & Front‑end Optimization (QPS < 100 000)
Deploy application servers in a cluster behind Nginx load balancing, static‑page CDN, IP‑level rate limiting, Redis cluster, and MySQL master‑slave replication.
Application servers clustered with Nginx load balancer.
Static pages served from CDN to offload origin servers.
Nginx implements IP‑level rate limiting and black‑white lists.
Redis clustered for higher availability and performance.
MySQL master‑slave: reads from replicas, writes to master.
Achieves 50 k‑100 k QPS, page load 10× faster, 99.9 % availability. Problems: rate‑limit logic scattered, no message‑queue peak‑shaving.
3.4 Stage 4 – Microservices & Asynchrony (QPS < 500 000)
Decouple flash‑sale service from the main site, introduce an API gateway for unified traffic control, and use a message queue to turn synchronous order requests into asynchronous processing.
Flash‑sale service fully decoupled and independently deployed.
API gateway provides unified routing, rate limiting, and circuit breaking.
Message queue enables async order creation.
Distributed global rate limiting based on user‑ID and activity‑ID.
Risk‑control system blocks bots and fraudulent requests.
This stage reaches 300 k‑500 k QPS, response < 20 ms, order‑failure rate < 0.1 %. New bottlenecks: hot Redis keys become a pressure point; message backlog must be monitored.
3.5 Stage 5 – Cloud‑Native Full‑Link Optimization (QPS > 1 000 000)
Deploy containers on Kubernetes with auto‑scaling, add a local Caffeine cache as a second‑level cache, segment inventory across multiple Redis nodes, implement full‑link observability, and practice chaos engineering.
Kubernetes for container orchestration and elastic scaling.
Caffeine local cache mitigates hot‑key issues.
Inventory segmented across several Redis nodes.
Full‑link monitoring for request tracing and performance analysis.
Chaos engineering exercises validate fault tolerance.
Peak QPS exceeds 1 M, availability rises to 99.99 %, multiple products can flash‑sale concurrently, and recovery time shrinks to seconds.
4 Core Technical Details
4.1 Multi‑Level Cache
Multi‑level caching brings hot data closer to the user, reducing backend load.
Browser cache : static resources and pages.
CDN cache : global edge nodes serve static assets.
Nginx cache : caches static pages and hot API responses.
Application local cache : Caffeine stores frequently accessed data in‑process.
Redis distributed cache : stores inventory, user purchase rights, and other core data.
4.2 Traffic Control & Rate Limiting
Rate limiting is applied at multiple layers.
Front‑end: button disabled after click, max 3 requests per second per user.
Nginx: limit_req module limits each IP to 10 req/s.
Gateway: Redis + Lua sliding‑window implements a global QPS ceiling.
Application: Sentinel provides service‑level and interface‑level limits as a safety net.
Algorithm comparison:
Fixed window – simple but vulnerable to traffic spikes.
Sliding window – mitigates spikes, more complex.
Token bucket – smooths bursts, complex implementation.
Leaky bucket – strict rate control, cannot handle bursts.
4.3 Oversell Prevention & Inventory Consistency
Solution comparison (performance / consistency / suitable scenario):
Database pessimistic lock (SELECT … FOR UPDATE): poor performance, high consistency, low‑concurrency.
Database optimistic lock (UPDATE … WHERE version = ?): moderate performance, high consistency, medium concurrency.
Redis distributed lock (SETNX + EXPIRE): good performance, high consistency, high concurrency.
Redis + Lua script : excellent performance, high consistency, ideal for extreme concurrency.
Recommended approach: Redis + Lua script.
-- 库存key
local stockKey = KEYS[1]
-- 用户已购买key
local userKey = KEYS[2]
-- 用户ID
local userId = ARGV[1]
-- 检查用户是否已购买
if redis.call('sismember', userKey, userId) == 1 then
return -1 -- 用户已购买
end
-- 检查库存是否充足
local stock = tonumber(redis.call('get', stockKey))
if stock <= 0 then
return -2 -- 库存不足
end
-- 扣减库存
redis.call('decr', stockKey)
-- 记录用户已购买
redis.call('sadd', userKey, userId)
return 1 -- 扣减成功Final consistency guarantees:
Database layer re‑checks inventory when creating an order.
Periodic reconciliation task compares Redis and DB stock every minute.
Oversell compensation rolls back Redis and notifies the user if DB deduction fails.
All operations are designed to be idempotent.
4.4 Asynchrony & Message Queue
The message queue smooths traffic peaks, decouples services, and ensures reliable processing.
Peak‑shaving: converts bursty traffic into a steady stream.
System decoupling: flash‑sale service and order service can scale independently.
Async handling: non‑critical steps run asynchronously, improving response speed.
Retry mechanism: guarantees message delivery.
Queue candidates:
RocketMQ – supports transactional, delayed messages and retries, suited for e‑commerce.
Kafka – high throughput, ideal for log collection and big‑data pipelines.
RabbitMQ – versatile messaging patterns for complex business flows.
5 End‑to‑End Design and Real‑World Case
5.1 Complete Architecture Diagram
5.2 Case Study: 618 Flash‑Sale of an E‑Commerce Platform
Background : 3 000 limited‑edition phones, expected traffic 1.2 M QPS, goal – zero oversell and good user experience.
Optimization measures :
Multi‑level cache : CDN static pages, Redis warm‑up, Caffeine local cache.
Traffic control : front‑end sliding captcha, gateway total QPS limit 800 k, per‑user limit 5 req/s and max 3 purchases.
Inventory : Redis + Lua atomic decrement, inventory segmented across 10 Redis nodes, DB final verification.
Async processing : order creation & payment async via RocketMQ with retry and dead‑letter queues, front‑end polls for result.
Results :
Peak QPS: 1.12 M
Average response: 18 ms (99‑pct 35 ms)
Order success rate: 99.8 %
No oversell observed
System remained stable without any failures
6 Summary
The flash‑sale system evolution boils down to two actions: block excess traffic at multiple layers and convert the spike into asynchronous processing. Only a tiny fraction of requests reach the data layer, and the async queue smooths the load—these two mechanisms decide whether the system survives under extreme load.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture & Thinking
🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
