Big Data 23 min read

Scalable Tag System Architecture and Optimization

The rebuilt tag system introduces a three‑layer architecture, standard pipelines, Iceberg‑backed storage and custom ClickHouse sharding, a DSL for crowd selection, and a stateless online service, achieving 99.9% success, sub‑5 ms latency, and supporting thousands of tags across dozens of business scenarios while planning real‑time processing and automated lifecycle management.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Scalable Tag System Architecture and Optimization

Background

The tag system is a technology for organizing and classifying information, widely used in content management, SEO, recommendation systems, and user behavior analysis. With the development of AI/ML, cloud computing, and big‑data technologies, the system can automatically learn, improve tag generation accuracy, and support personalized recommendations. Bilibili launched its tag system in 2021 to solve ad‑hoc query problems, but faced performance bottlenecks, lack of standards, duplicated solutions, and integration issues by mid‑2022.

Construction Goals

Accelerate and solidify data pipelines: open data source ingestion and tag creation to all business units, define clear standards and approval processes.

Introduce a multi‑source compute‑store engine to improve stability and performance.

Build a site‑wide universal tag system with external connectivity to data platforms and internal integration of other data products.

Architecture Design

Implementation Strategy

The reconstruction is carried out from six aspects: technical upgrade, standard establishment, platform‑level management, user participation, cross‑department collaboration, and effect evaluation.

Overall Framework

The system consists of three layers from bottom to top: Tag Production, Crowd Selection, and Crowd Application.

Tag Production

Tag production follows three stages: definition, pre‑construction, and production. Definition clarifies business meaning, attributes, and generation logic. Pre‑construction validates logic and performance. Production configures update cycles, scheduling, and quality monitoring.

Data sources trigger Spark offline jobs at midnight to build tags; metadata about source‑tag bindings is recorded for downstream crowd selection.

Crowd Selection

Multiple crowd creation methods are supported: rule‑based, Excel import, HTTP link import, Hive table queries, and DMP synchronization. Each method provides a stable data import interface, validation, and UI where applicable.

Crowd Application

The system integrates with the Polaris event‑tracking platform and AB testing platform, enabling personalized recommendation, targeted marketing, user segmentation, and service optimization.

Core Solutions

3.1 Tag Build Optimization – Iceberg Support

Original design stored all tag details in ClickHouse, causing large queries and OOM risks. By introducing Apache Iceberg, detailed tag data and continuous tags are stored in Iceberg, while ClickHouse only stores bitmap data, reducing load and storage cost.

3.1.2 Custom Shard‑Based Read/Write for ClickHouse

Data is sharded by user ID across ClickHouse shards, enabling parallel processing and load balancing. The process hashes (userId, sparkTaskCount) to n*m partitions, writes bitmap data directly to the target shard, and leverages local‑table calculations to avoid cross‑shard merges. This improved success rate from 85% to 99.9% and increased speed by 50%.

3.2 Crowd Set Operations

Rule‑based crowd selection is expressed in a DSL that the backend compiles into an optimized DAG of tasks. Continuous tags are queried via Trino on Iceberg, producing bitmaps that are merged with discrete tag bitmaps in ClickHouse. The DSL supports functions such as LESS, MOST, GREATER, BETWEEN, EQUAL, IN, LIKE, etc.

/* Parser */ condition : condition AND condition #logicalAnd | condition OR condition #logicalOr | LPAREN condition RPAREN #paren | compare #logicalOp2 | variableDeclarator #commomOp ; compare : variableDeclarator op variableDeclarator #logicalOp ; variableDeclarator : TRUE #logicalTrue | FALSE #logicalFalse | INT #commonInt | DOUBLE #commonDouble | STRING #commonString | IDENTIFIER #variable ; /* Lexer */ op : EQ | NE ; EQ : '==' ; NE : '!=' ; OR : '||' ; AND: '&&' ; NOT: '!' ; LPAREN: '(' ; RPAREN: ')' ; TRUE : 'true' ; FALSE : 'false' ; INT : [0-9]+ ; DOUBLE : [0-9]+('.'[0-9]+)? ; STRING : '"' ( '\\"' | . )*? '"' ; IDENTIFIER: Letter LetterOrDigit* ; fragment LetterOrDigit : Letter | [0-9] ; fragment Letter : [a-zA-Z$_] ; WS : [ \r\n\t]+ -> skip ; 3.3 Online Service The online tag service is a stateless micro‑service that determines whether a user belongs to a specific crowd in real time. It ensures high availability, precise permission control, horizontal scalability, and full lifecycle management (generation, validation, configuration, launch, and retirement). Version management keeps up to five versions per crowd, storing data in Redis using a KKV structure (userId → crowdId → versionInfo) to enable single‑request multi‑crowd checks. Traffic control allows gray‑release of new crowd definitions, gradual rollout, and fast rollback via a traffic‑control table. Deployment Results Tag‑crowd success rate reached 99.9%. Selection latency: <10 s for tens of millions of users (offline tags on ClickHouse), 10‑30 s for hundreds of millions, 1‑2 min for 10‑50 million continuous tags (Iceberg). Online service stability 99.999% with <5 ms response time. Supported 30+ business scenarios across push, activity, task, risk control, etc. Tag count > 3000, crowd size > 100 k. Future Plans Real‑time tag and crowd processing using Flink or Spark Streaming. Dynamic tag generation based on metrics, reusable metric models, and automated workflows. Effect monitoring, data‑link closure, and lifecycle management for crowds.

Big DataData PipelineClickHouseonline serviceSparkIcebergTag System
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.