Artificial Intelligence 20 min read

Design and Evolution of a Scalable Danmaku Personalized Recommendation System

The paper describes how Bilibili transformed its danmaku service from a simple, limited‑recall pipeline into a ten‑fold larger, KV‑store‑backed recommendation architecture that unifies engineering and AI layers, uses dynamic sharding and Redis locks, and ultimately boosts recall pool size, exposure, and experiment speed while reducing downgrade rates.

Bilibili Tech

Nov 25, 2022

Design and Evolution of a Scalable Danmaku Personalized Recommendation System

Background

Bullet‑screen (danmaku) services have evolved through three stages: basic capability, negative governance, and positive recommendation. After stabilizing the first two stages, Bilibili needed a personalized recommendation layer to select high‑quality danmaku for display, requiring integration of user features and a large recall pool.

Stage 1: Minimal Recommendation on the Original Architecture

The initial solution added a simple recommendation flow: danmaku senders write to a database whose binlog is streamed to the recommendation system; the recommendation system computes a list of danmaku IDs for each video, and the engineering system fetches the content for display.

Problems of Stage 1

• Limited recall pool (e.g., a 15‑minute video capped at 6,000 danmaku) leads to sparse screens. • Low quality because the pool is small and time‑ordered eviction discards valuable historic danmaku. • Uneven distribution causes empty screens in long videos.

Stage 2: Ten‑fold Expansion of the Recall Pool

A dedicated engineering system for personalized recommendation was built, replacing the original danmaku pool. The new system uses a KV store (Taishan) to keep per‑video, per‑minute recall data, and each danmaku is also stored by its ID. Redis distributed locks guarantee consistency. This design supports millions of QPS without additional caching.

Storage Design

The KV store holds only the data needed for recommendation (no full‑danmaku storage), allowing direct reads with high concurrency. The previous three‑level cache (interface cache → second‑level cache → third‑level cache) was removed.

Computation Optimizations

Recall pools are refreshed in 10‑second granularity, supporting up to 1,000 danmaku per 10 seconds. Full‑pool back‑fill (hundreds of billions of danmaku) takes about two days; incremental updates use message queues and Redis locks to keep consistency.

Shard Granularity

Dynamic shard sizes balance bandwidth and QPS: early versions used 6‑minute shards, later adjusted per scenario. Storage shards are 10 seconds to align with the recall strategy.

Dual‑System Relationship

The original system (TiDB‑backed) remains the source of truth for all danmaku, while the new KV‑based pool is a subset used only for personalized recommendation. In case of failure, the system degrades gracefully to the original pipeline.

Stage 3: Deep Integration of Engineering and Recommendation

Key issues identified:

Data misalignment between engineering and AI pipelines.

Frequent AI‑side degradation due to heavy model loading and lack of real‑time eviction.

Insufficient experiment speed; full back‑fill of billions of records takes days.

Solutions include merging material and index pools, pre‑eviction in coarse ranking, and enabling hour‑level model updates.

Detailed Design Highlights

• Unified material and index pools stored in the same KV database with video‑minute keys, protected by Redis shard locks. • Three back‑fill paths: incremental updates, index back‑fill, and material back‑fill, reducing data write volume by ~50%. • Model versioning fields allow safe rollout and rollback of scoring models. • Hot‑video handling moves eviction logic to coarse ranking, reducing memory pressure in fine‑ranking. • Experiments now cover 90% of view volume with only 15% of data recomputation, enabling sub‑hour strategy iteration.

Benefits and Outlook

The integrated system increased the recall pool by roughly tenfold (e.g., from 6,000 to ~90,000 danmaku for a 15‑minute video), raising exposure by ~30% and improving user experience. Fine‑ranking downgrade rate dropped from 3% to 0.1%, and full‑stack experiments can finish within 10 hours, with small‑scale iterations in under an hour. Future work will focus on faster strategy iteration, real‑time feature availability, and continued stability improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Engineering personalized recommendation AI integration scalable storage danmaku

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.