Design and Implementation of Bilibili's Automated Topic System with AI‑Driven Content Recall
By completely rebuilding its topic system in 2021, Bilibili introduced an AI‑driven pipeline that automatically discovers, creates, ranks, and populates hot and cold‑start topics using real‑time metrics, rule‑based and vector‑based recall, dramatically boosting content relevance, user interaction, and operational efficiency across the platform.
At the end of 2021 Bilibili completely rebuilt and launched a new topic system to improve user participation, discussion, and content sharing, aiming to provide a better product experience and increase traffic within the app ecosystem.
The system treats a topic as a connector; high‑quality topics can stimulate user consumption. To achieve this, the legacy system—limited by its architecture and product positioning—was abandoned in favor of a new pipeline that can automatically discover, create, rank, and populate topics.
Problems addressed:
1. Hot topics tied to real‑time events (e.g., World Cup finals) require rapid creation, which is labor‑intensive and hard to guarantee timeliness.
2. During the cold‑start phase, many high‑quality user‑generated contents are not tagged with topics, leading to missed inclusion.
3. Existing high‑quality content for Q&A‑type topics (e.g., college admission advice) needs to be discovered and linked to appropriate topics.
Hot‑topic discovery and workflow integration:
The team splits hot topics into two categories: (a) Bilibili‑specific user‑interest topics and (b) globally trending events. Internal hot‑word detection monitors search query UV spikes, aggregates results, and pushes them via Kafka to backend services and enterprise‑WeChat notifications. Daily hot‑word clouds are generated and displayed in reports.
Hot‑search (internal and external) detection is linked to the topic‑creation backend, enabling automatic topic creation and ranking adjustments based on real‑time metrics.
Automatic topic ranking:
Real‑time metrics such as square‑page UVCTR, page PV/UV, and button‑click conversion rates are collected via client telemetry, sent to the Polaris system, processed by Flink, and stored in ClickHouse. Draft topics are reviewed, approved, and moved to a candidate pool. A scheduled task re‑ranks candidates every ten minutes, promoting the top‑ranked topics to the homepage while demoting under‑performing ones.
Automatic content collection (recall):
A comprehensive recall pool is built on Elasticsearch using dynamic IDs as doc_id, covering all content types (articles, images, reposts, etc.). Two recall strategies are employed:
1. Custom rule‑based recall : Operators define inclusion/exclusion rules on title, tags, partitions, etc., and the system queries ES for matching content.
2. AI‑driven recall : A DSSM twin‑tower model trained on labeled positive (user‑bound) and negative (operationally flagged) samples predicts topic‑content relevance. Offline, the model scores candidate contents for cold‑start topics; online, it provides real‑time relevance scores for implicit collection scenarios.
Recall methods:
Keyword‑bag search extracts entities from topic names/descriptions, combines NER, IDF, regex, and LCS scores to build a keyword bag, then boosts ES queries to retrieve top‑N items. This works well for popular topics but may introduce noise for niche topics.
Vector‑based recall replaces keyword search with dense vector similarity, ensuring semantic alignment between topics and contents. Statistical filters further prune results, achieving high precision even for long‑tail topics.
Project outcomes:
Within one month of launch, over 300 topics were generated, with >80% meeting hot‑topic criteria and occupying more than half of the hot‑topic list. Each topic received an average of 5+ high‑interaction contents via AI recall, dramatically reducing cold‑start time. For utility topics (Q&A), AI recall linked >6 relevant items to 8,000+ topics.
Metrics show steady growth in both visitation and interaction across the topic ecosystem, confirming that automated hot‑topic discovery and ranking substantially improve content quality, timeliness, and operational efficiency.
Future plans:
Continue refining the hot‑topic discovery tool, enhance automatic ranking algorithms to boost CTR, and platform‑ize the hot‑topic detection and semantic understanding modules for broader reuse (e.g., push notifications, hot‑search, recommendation). This will provide a reliable, reusable capability for the Bilibili ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
