How Ctrip Scaled Its Ticket Booking System for Flash‑Sale Events
This article analyzes the challenges Ctrip faced when handling massive traffic during ticket flash‑sale events and details the architectural upgrades, caching strategies, database optimizations, supplier integration safeguards, and traffic‑control mechanisms that enabled stable, fast, and consistent booking experiences.
Background
In the post‑pandemic era the travel industry recovered rapidly, causing frequent high‑traffic flash‑sale promotions. Ctrip’s ticket reservation system must handle billions of requests while guaranteeing a smooth booking experience for domestic and international users.
Flash‑Sale Characteristics
Flash‑sale scenarios (e.g., Double‑11, 618, train‑ticket rushes, concert tickets) share three core traits: massive concurrent traffic, strict time sensitivity, and the need for strong consistency with multi‑dimensional purchase limits.
2020‑08‑08~2020‑09‑01: "HuiYou Hubei" event, traffic 45× normal (hundreds of thousands QPS).
2021‑09‑14: Beijing Universal Studios opening, highest sales among competitors.
2023‑09‑15: Wuhan Zoo opening, stable ordering despite supplier failures.
2024‑04‑10: IU global concert, tickets sold out in 10 seconds.
System Goals
Stability : uninterrupted service under peak load.
Accuracy : strong transactional consistency.
Speed : fluid booking experience with rapid confirmation.
Stability Challenges
Redis overload & cache hot‑key
Horizontal scaling alone cannot eliminate hotspot keys that concentrate CPU usage. The solution is a multi‑level cache with automatic hot‑key detection.
Hot‑key detection promotes keys accessed >10 times per second on a single node to a higher‑level cache or local memory, reducing Redis load and latency.
Large cache keys
Oversized keys cause memory pressure, network blockage, and slower queries.
Trim redundant fields.
Apply higher‑ratio compression.
Split large keys into smaller ones (evaluate I/O impact).
Establish a weekly scan to clean up big keys.
After optimization query latency dropped from ~300 µs to ~100 µs.
Database overload
Cache‑miss storms during flash sales create DB pressure. The original cache‑eviction listener deleted keys, leading to cache‑penetration and DB overload.
Cache‑cover update : update cache values directly instead of deleting them.
Message aggregation : batch rapid change events into a single update.
Asynchronous cache refresh : queue update tasks for background processing.
Supplier system instability
Supplier APIs may become slow or rate‑limited under load, jeopardizing order flow.
Peak‑shaving buffer pool : use a message queue to decouple order intake from supplier calls.
Automatic disable‑sale : monitor supplier health and temporarily ban affected suppliers.
Retry mechanism : periodically retry failed orders with adaptive intervals.
Traffic‑Control Strategy
Fine‑grained rate limiting per page and per product prevents a single hot item from overwhelming the system.
SOA‑level interface throttling.
Custom product‑level limits using sliding windows (e.g., 10 × 100 ms windows per second).
Automatic hotspot detection similar to Redis hot‑key logic.
Data Consistency
Accurate stock deduction is critical. Traditional relational DB row‑level locks become a bottleneck.
Solution: asynchronous stock deduction workflow.
Initialize: sync flash‑sale inventory to Redis.
Deduct in Redis at purchase time, then publish a message to asynchronously update the DB.
Return stock: on cancellation, reverse DB then Redis updates.
Eliminates row‑level lock contention and supports tens of thousands of orders per minute.
High Availability & Sustainability
Continuous architectural health governance and dedicated large‑event safeguard plans are essential.
Health metrics cover:
System runtime stability.
Architectural complexity (service count, dependency depth).
Engineering quality.
For major events and holidays, pre‑emptive stress testing and disaster‑recovery plans ensure the system remains operational under extreme load.
Conclusion
The ticket reservation system addresses flash‑sale challenges through multi‑level caching, cache‑cover updates, asynchronous stock handling, supplier‑side safeguards, and fine‑grained traffic control, while maintaining continuous health monitoring and high‑availability planning to sustain performance under massive concurrent traffic.
Code example
相关阅读:Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
