Case Study: JD Mask Reservation System Architecture and Performance Optimization During COVID‑19
This article details how JD.com designed, scaled, and refined its mask reservation backend during the COVID‑19 pandemic, addressing extreme traffic spikes, hot‑key issues, MySQL bottlenecks, and introducing a circuit‑breaker mechanism to protect the overall transaction system.
When the COVID‑19 pandemic erupted in early 2020, JD.com launched a mask reservation and flash‑sale system to help users obtain scarce protective equipment. The activity used an appointment‑plus‑seckill model, where users first booked a slot and later purchased the mask during a short flash‑sale window.
The sudden surge in demand caused unprecedented traffic on a single SKU, far exceeding the load seen in typical large‑scale sales events such as 618 or Double‑11. Two main factors contributed to the challenge: (1) user behavior focused on a single product, creating a "super‑hot" SKU, and (2) the appointment system’s read/write TPS far surpassed historical peaks, with write TPS reaching 70 times the previous maximum.
To handle the load, JD quickly upgraded the reservation system architecture. The system consists of a "Reservation SOA" module that provides three core APIs—fetching reservation info, adding reservation eligibility, and validating eligibility. Data is stored in Redis clusters (one for SKU information, one for user reservations) and MySQL for persistent storage.
During the first day of the event, the team discovered four critical problems:
Missing monitoring for the super‑hot SKU, leading to delayed alerts.
MySQL pagination queries for millions of users became increasingly slow.
Absence of a circuit‑breaker (rate‑limit) caused the reservation system to become a de‑facto flash‑sale service, overwhelming Redis.
Hot‑key contention on the Redis counter that tracks the number of reservations per SKU.
To mitigate these issues, the team implemented several emergency measures:
Ended the ongoing reservation early and migrated existing users to a new reservation session, ensuring they could still participate in the flash‑sale.
Created a dedicated seckill system for the mask SKU to offload the main order flow.
Introduced a circuit‑breaker that stops new reservations once a predefined limit is reached, preventing further traffic spikes.
Optimized the Redis hot‑key by batching updates to the reservation count, reducing write OPS dramatically.
The SQL used for migrating users highlighted the MySQL performance bottleneck:
SELECT ... FROM MEMBER_TABLE a inner join ( SELECT id as id1 FROM MEMBER_TABLE WHERE ... LIMIT #{page.beginIndex},#{page.step} ) b ON a.id=b.id1After applying batch updates and enabling rate‑limiting, the "add reservation eligibility" API TPS dropped to 2% of its original load, CPU usage on the Redis master fell from 100% to 13%, and response times improved by over 500×.
In summary, the case study demonstrates how rapid architectural adjustments, proactive monitoring, circuit‑breaker design, and Redis write‑reduction strategies can enable a reservation system to withstand extreme traffic spikes, protect downstream services, and ensure a stable user experience during crisis‑driven demand.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.