Design and Implementation of a Scalable Reward System for Bilibili Live Platform
The paper presents a scalable, message‑queue‑driven reward system for Bilibili Live that unifies diverse reward types and distribution scenarios through standardized APIs, layered fast/slow queues, idempotent processing, multi‑stage retries, and comprehensive monitoring to ensure low latency, no over‑issuance, and reliable delivery.
Background – The Bilibili live streaming platform has rapidly grown, and to attract more viewers and high‑quality streamers it launches various interactive activities (festivals, tasks, leaderboards, lotteries) that all share a common scenario: rewarding users. To support diverse reward types and distribution scenarios, a generic reward system was designed.
Requirement Analysis
The system must address two sides: (1) integration of reward‑distribution scenarios, and (2) integration of reward‑type configurations. Standardized interfaces are defined to simplify future extensions.
From the business side, the system must handle many distribution scenarios such as leaderboard rewards, task completion rewards, lottery rewards, and quiz results. From the reward‑type side, it must support both entitlement items (avatar frames, titles, room skins) and value items (batteries, gold seeds, packages), each with different configuration attributes and downstream throughput capabilities. Therefore, per‑reward throttling limits are required.
Latency is critical for high‑priority business flows; the system must guarantee low latency, no over‑issuance, and no missed issuance.
Architecture Overview
The overall architecture adopts a message‑queue‑based peak‑shaving and asynchronous processing model. Key layers:
Access Layer – defines integration standards and provides APIs for reward issuance, reclamation, wearing, and canceling.
Configuration Layer – configures reward packages (leaderboard, task, lottery) and supports new reward‑type onboarding.
Service Layer – processes requests, splits packages, records issuance status, manages secondary queues, and dispatches to downstream reward channels.
Storage Layer – includes databus (MQ) for high‑throughput, MySQL for core mappings and logs, TaishanKV for extended configs, Redis for secondary and delayed‑retry queues, and HDFS for offline analytics.
Offline Layer – provides data retrieval, anomaly scanning, log aggregation, and metric collection.
Detailed Design
The issuance flow follows four serial modules: upstream business → primary queue → secondary queue → downstream delivery. The design addresses the three core requirements:
Low latency – separate fast and slow channels; high‑priority business can use the Fast channel.
No over‑issuance – full‑link idempotency using a composite key (source+msg_id) and unique DB constraints.
No missed issuance – layered retry mechanisms at each asynchronous step.
1. Fast/Slow Queues
High‑priority traffic uses a fast channel backed by a high‑availability MQ (databus) with three priority levels (Fast, Medium, Slow). Low‑priority traffic uses Redis list queues, which are easier to scale but have lower durability, compensated by offline retry.
2. Idempotency
Messages contain a composite unique key (source+msgId+awardTypeId+awardId+uid). The primary processor writes to sharded MySQL tables (uid%64) using this key to guarantee idempotent writes. Downstream services also expose idempotent APIs.
3. Retry Mechanism
Secondary‑queue consumers forward to downstream services. Failures trigger retries stored in a tertiary Redis ZSET‑based delayed queue with exponential back‑off (2^0 s → 2^12 s, up to 13 attempts). After exhausting retries, the issue is escalated for manual intervention.
4. Offline Compensation
A scheduled job periodically scans the offline store for uncompleted records and re‑issues them, ensuring eventual delivery even in worst‑case scenarios.
5. Business Integration
The system defines a concise, extensible message format for upstream services:
{
"source": "(int64) business source",
"msg_id": "(string) unique message ID",
"uids": "([]int64) target user IDs",
"package_id": "(string) reward package ID",
"msg_time": "(int64) optional timestamp",
"extra_data": "(string) JSON for extensions",
"business_type": "(string) for statistics",
"business_id": "(string) for statistics",
"expire_time": "(int64) optional override expiration"
}Example extra_data includes fields for lottery count or live‑room specific parameters.
6. Reward Configuration
Reward packages are defined via a visual panel and consist of four parts: package ID, basic attributes (type, ID, quantity, validity), extended attributes (type‑specific parameters), and display configuration. The underlying Go structs are:
// AwardType – reward type definition
type AwardType struct {
TypeId int64 `json:"type_id"`
TypeName string `json:"type_name"`
ShowTypeName string `json:"show_type_name"`
TypeVal string `json:"type_val"`
AwardIdControl string `json:"award_id_control"`
IsNum int64 `json:"is_num"`
IsTime int64 `json:"is_time"`
TimeConfigData []*AwardTypeTime `json:"time_config_data"`
ExtendPropControl []*ExtendPropControlItem `json:"extend_prop_control"`
}
type SelectItem struct {
Id string `json:"id"`
Name string `json:"name"`
}
type ExtendPropControlItem struct {
PropName string `json:"prop_name"`
ColumnName string `json:"column_name"`
PropControl string `json:"prop_control"`
PropSelectData []*SelectItem `json:"prop_select_data"`
}
type AwardTypeTime struct {
TimeType int64 `json:"time_type"`
TimeTypeDesc string `json:"time_type_desc"`
TimeValControl string `json:"time_val_control"`
TimeUnitControl string `json:"time_unit_control"`
TimeUnitData []*SelectItem `json:"time_unit_data"`
}The backend can expose metadata so that front‑end panels render automatically without code changes.
7. New Reward‑Type Integration
Standardized issuance and reclamation interfaces allow rapid onboarding of new reward types. Implementations inherit from AwardTypeDefaultFunc and provide concrete downstream calls.
8. Data Monitoring
Instrumentation records issuance metrics per reward type, enabling detection of abnormal spikes and facilitating throttling or isolation decisions.
Future Plans
Introduce configuration self‑validation to catch errors early.
Expand monitoring granularity (upstream sources, queue health, exception alerts).
Automate regression testing in staging environments.
— End —
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.