How Bilibili Scaled Its Membership Store: Async Order Processing and Sharding Strategies
This article details how Bilibili's Membership Store tackled massive traffic spikes by optimizing call chains, introducing asynchronous order processing, and implementing a sharding strategy that split databases and tables, ultimately achieving over 4000 TPS and stable performance during large‑scale promotions.
Background
Bilibili launched its Membership Store in 2017, offering platform‑aligned products such as figures, comics, and JK uniforms. As the business grew, sales expanded from pre‑sale and stock items to full‑payment pre‑sale, blind boxes, and crowdfunding, with channels across Cat Ear (now offline), QQ mini‑programs, and comics. Seasonal promotions like the New Year Festival, the 626 anniversary, and the 919 anniversary generated traffic spikes of several hundred times the normal load, posing a serious challenge to the transaction system.
Performance Challenges
During major promotions, the order‑creation interface suffered from long latency (400 ms+) and limited QPS, leading to poor user experience. Analysis of the original call chain revealed multiple redundant and serial service calls, causing the order flow to be I/O‑bound with low CPU utilization.
Call‑Chain Optimization
The team refactored the order workflow using a responsibility‑chain pattern, reducing redundant calls and enabling concurrent invocation of independent services (product, shop, activity, user info). Key optimizations included:
Concurrent calls to services without dependencies.
Eliminating duplicate calls and consolidating downstream interfaces.
Setting reasonable timeouts (e.g., 200 ms) and connection retries.
Removing external calls from transactional contexts (e.g., MQ, cache).
Asynchronous handling of weak‑dependency operations such as follow‑shop, cache phone, and coupon rollback.
After these changes, average interface latency dropped from ~300 ms to ~200 ms, significantly improving the ordering experience.
Asynchronous Order Optimization
High‑inventory flash‑sale scenarios (e.g., 5,000 units of a limited‑edition figure) caused QPS bottlenecks around 600 TPS, with severe database row‑level locking and connection exhaustion. To flatten traffic peaks, the team introduced an asynchronous, queue‑based ordering flow.
Orders are validated, an order ID is generated, and the request is placed onto a Databus message queue. Consumers batch‑process up to 20 messages at a time, merging orders, freezing inventory, and applying coupons in parallel before persisting results to MySQL and Redis.
During the queue wait, the front‑end displays a “high demand, processing” message and polls the order‑status API for up to 30 seconds, with a hard timeout to avoid endless loops.
Sharding (Database Partitioning) Strategy
By 2020, core tables reached tens of millions of rows, causing master‑slave latency, long DDL windows, and lock contention. The team adopted a sharding approach, selecting mid (merchant ID) and order_id as shard keys.
Four clusters (master‑slave) were deployed, each with four databases and 16 tables per database, totaling 256 tables. Routing rules:
Database index = mid % 16 Table index = (mid % 512) / 32 Mathematical formulation:
Intermediate = MID % (dbCount * tableCount) Database = Intermediate % dbCount Table =
floor(Intermediate / dbCount)Open‑source solutions evaluated included Alibaba TDDL/DRDS, Sharding‑Sphere, MyCAT, 360 Atlas, and Meituan Zebra. The team chose the CLIENT mode (e.g., Sharding‑Sphere) for its simplicity and lower overhead.
Migration Steps
Archive historical data; keep old data in the legacy database.
Gradually route read/write traffic to the new sharded system.
Use binlog listeners to replicate new writes back to the old system for verification.
Results and Conclusion
After implementing call‑chain optimization, asynchronous ordering, and sharding, the system sustained over 4000 TPS in load tests and handled real‑world promotion spikes without incidents. The comprehensive refactor improved latency, reduced database contention, and ensured stable operation for future high‑traffic events.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
