Big Data 24 min read

How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy

The article details Bilibili's comprehensive redesign of its tag system—including background challenges, architectural layers, technical upgrades like Iceberg integration and shard‑based ClickHouse writes, crowd selection methods, online service guarantees, performance metrics, and future plans—showcasing a data‑driven solution that boosts stability, speed, and business coverage.

dbaplus Community

Dec 24, 2024

How Bilibili Scaled Its Tag System for Massive Data and Real‑Time Accuracy

Background and Motivation

Tag systems organize and classify information across content management, SEO, recommendation, and user behavior analysis. With the rise of AI/ML, tags can self‑learn, improving accuracy and personalization. Bilibili launched its tag system in 2021 to reduce ad‑hoc queries, but by mid‑2022 faced performance bottlenecks, lack of standards, duplicated custom solutions, and unstructured downstream integration.

Identified Problems

Performance bottleneck : increasing query volume exhausted the existing computation model.

Missing standards : no unified tag construction guidelines forced manual data‑source onboarding.

Redundant implementations : multiple bespoke tag subsystems lacked universality.

Unstandardized downstream integration : downstream services could not reliably consume tag results.

Construction Goals

Accelerate and solidify the data pipeline, opening tag creation to all business units with self‑service while enforcing clear standards and approval processes.

Introduce a multi‑source compute engine to improve stability and performance.

Enable external connectivity to data platforms and internal applications for seamless data flow.

Integrate other data‑product platforms to build a site‑wide unified tag system.

Architecture Design

The system is organized into three vertical layers: tag production , crowd selection , and crowd application . Each layer follows a staged workflow to ensure reliability and scalability.

1) Tag Production

Tag production is the foundation and proceeds through three ordered phases:

Tag definition : business meaning, usage scenarios, attribute design, and generation logic are defined collaboratively with stakeholders.

Build phase : prototypes validate logic and benchmark performance on large data sets.

Production phase : schedule, update cycles, and quality monitoring are configured for stable online operation.

Offline tag production runs daily at midnight via Spark jobs that read source tables, generate tag‑user mappings, and store metadata for downstream crowd selection.

2) Crowd Selection

Multiple creation methods support diverse business needs:

Rule‑based creation : a flexible tag engine lets users combine tags to define audiences.

Excel import : a stable bulk‑import API validates and ingests large Excel files.

HTTP link import : services parse external URLs and sync data.

Hive table import : direct SQL queries on Hive tables generate audiences.

DMP sync : integration with a data‑management platform synchronizes crowd packages.

These methods feed into the crowd selection engine, which supports multi‑level set operations (union, intersection, difference) and produces bitmap representations for fast downstream consumption.

3) Crowd Application

The tag system integrates with Bilibili’s analytics platform and AB‑testing framework, enabling:

Targeted push notifications and personalized content.

Audience segmentation for marketing and service optimization.

Real‑time user profiling and behavior analysis.

All applications share a unified data‑driven workflow that eliminates data silos.

Core Solutions

1) Tag Construction Optimization

Initial design stored all tag details in ClickHouse, causing large bitmap queries and occasional OOM crashes. By introducing Apache Iceberg , tag details and continuous tags are stored in Iceberg, while ClickHouse retains only bitmap data. This reduces ClickHouse load, improves query latency, and cuts storage costs.

Additionally, a custom shard‑aware read/write strategy distributes tag data across ClickHouse shards, achieving load balancing, parallel processing, and query pruning. The shard‑based pipeline hashes user IDs to n*m partitions (n = shard count, m = concurrency factor) and writes bitmap data directly to the target shard.

Performance gains include a success rate increase from 85 % to 99.9 % for crowd calculations and a 50 % speedup for identical data volumes.

2) Crowd Set Operations

The new workflow replaces a single‑threaded real‑time engine with a DAG‑based task queue. Each crowd selection request is broken into small tasks, assigned to appropriate compute engines (Spark, Trino, etc.), and submitted to a scheduler. The DSL describes rule expressions; the system translates DSL to optimized SQL (Iceberg for continuous tags, ClickHouse bitmap operations for discrete tags) and generates the minimal DAG.

Example DSL functions include LESS, BETWEEN, IN, LIKE, and equality/inequality operators for both numeric and string tags. The parser grammar (ANTLR) supports logical AND/OR, parentheses, and comparison operators.

/* Parser */
condition
    : condition AND condition           #logicalAnd
    | condition OR condition            #logicalOr
    | LPAREN condition RPAREN           #paren
    | compare                           #logicalOp2
    | variableDeclarator                #commomOp
    ;

compare
    : variableDeclarator op variableDeclarator #logicalOp
    ;

variableDeclarator
    : TRUE              #logicalTrue
    | FALSE             #logicalFalse
    | INT               #commonInt
    | DOUBLE            #commonDouble
    | STRING            #commonString
    | IDENTIFIER        #variable
    ;

/* Lexer */
op : EQ | NE ;
EQ : '==';
NE : '!=';
OR : '||';
AND: '&&';
NOT: '!';
LPAREN: '(';
RPAREN: ')';
TRUE  : 'true' ;
FALSE : 'false';
INT : [0-9]+;
DOUBLE : [1-9][0-9]*|[0]|([0-9]+\.[0-9]+);
STRING : '"' ('\\"'|.)*? '"' ;
IDENTIFIER: Letter LetterOrDigit*;
fragment LetterOrDigit : Letter | [0-9] ;
fragment Letter : [a-zA-Z$_] ;
WS : [ 
\t]+ -> skip;

After optimization, average crowd selection time dropped to ~30 s (120 % efficiency gain).

3) Online Service

The online tag service is a stateless microservice handling real‑time user‑to‑crowd judgments. It guarantees:

Security : high‑concurrency availability and fine‑grained crowd permission control.

Horizontal scalability : independent nodes with no shared state, allowing seamless scaling.

Full lifecycle coverage : from crowd generation, validation, configuration, activation to deactivation.

Stability reached 99.999 % with sub‑5 ms response latency.

4) Version Management and Traffic Control

Crowd versions are stored in Redis using a KKV schema (user‑id → crowd‑id → version). Up to five recent versions are retained per crowd. A periodic cleanup removes expired keys. Traffic control tables enable gray‑release of new crowd rules: each crowd can be assigned a rollout percentage, and replacements are managed without affecting unchanged crowds.

5) Condition Evaluation

Requests use ANTLR‑generated parsers to evaluate logical expressions such as tag_1 == 1 && (tag_2 == 0 || tag_3 == 1), supporting multi‑crowd set operations.

Results and Impact

Tag production success rate: 99.9 %.

Offline crowd selection latency: <10 s for tens of millions, 10‑30 s for hundreds of millions, 1‑2 min for 10‑50 million with Iceberg assistance.

Online service stability: 99.999 % with <5 ms latency.

Supported >30 business scenarios across push, activity, task, and risk‑control platforms.

Tag count >3 000, crowd size >100 k.

Future Plans

Introduce real‑time tag generation using Flink or Spark Streaming and close the data‑to‑business feedback loop.

Build a metric‑model library for reusable tag definitions and automate tag creation workflows.

Implement effect monitoring, end‑to‑end data‑linking, and lifecycle management to continuously improve tag efficacy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering ClickHouse Online Service Distributed Computing Spark Iceberg tag system

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.