Big Data 30 min read

Technical Evolution of Bilibili's PolarStar User Behavior Analysis Platform

Bilibili’s PolarStar platform evolved from Spark‑based batch jobs to a Flink‑driven real‑time pipeline and finally to a unified Iceberg‑on‑ClickHouse model, cutting query latency to seconds, saving thousands of CPU cores and hundreds of gigabytes of Redis memory while enabling complex, near‑real‑time user‑behavior analyses and scalable data‑import, rebalancing, and compression optimizations.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Technical Evolution of Bilibili's PolarStar User Behavior Analysis Platform

The article introduces the PolarStar (北极星) user‑behavior analysis platform built by Bilibili, describing its background, the need for data‑driven insight, and the evolution of the underlying data architecture from early prototypes to a mature big‑data solution.

Technical evolution is divided into three major phases:

2019 ~ 2020 – Partial model aggregation with Spark Jar jobs. Data were processed by submitting Spark JAR tasks for each analysis module, leading to long query latency, limited model flexibility, and resource‑inefficient YARN scheduling.

2020 ~ 2021 – Migration to a model‑less approach using Flink for real‑time cleaning and ClickHouse for storage. This enabled sub‑5‑second event queries and real‑time analysis but required high resource consumption (hundreds of cores, large Redis caches) and duplicated storage across Kafka, Hive, and ClickHouse.

2021 ~ present – Full model aggregation with Iceberg on top of ClickHouse. By introducing a unified flow‑aggregation model, bulk‑load pipelines, and dictionary services, the platform saved ~1400 CPU cores, reduced Redis memory by 400 GB, and cut query times to a few seconds while supporting multi‑day retention and complex analyses.

Event and retention analysis are achieved through pre‑aggregation of user, event, and time dimensions, allowing near‑real‑time queries (10 s) on data compressed from billions to hundreds of millions of rows per day.

Funnel and path analysis leverage ClickHouse functions such as windowFunnel and bitmap operations to compute conversion funnels and Sankey‑style path visualizations with second‑level latency.

Tagging and user‑group selection use RoaringBitmap (RBM) and an attribute‑dictionary service (gRPC + Redis + custom rockdbKV) to encode user attributes, support high‑throughput QPS (>5 × 10⁵), and enable cross‑business tag and AB‑test audience generation.

ClickHouse data import progressed from a simple JDBC sink (high server load, merge pressure) to a bulk‑load solution that generates ClickHouse data parts on Spark executors and uploads them via HDFS, and finally to a direct bulk‑load service (DataReceive) that streams parts over HTTP, roughly doubling import performance.

ClickHouse rebalancing addresses the need to redistribute petabyte‑scale data across a non‑elastic cluster. The authors define a balance‑degree metric based on the coefficient of variation, propose two algorithms (best‑fit bin‑packing with AVL trees and a greedy min‑max part migration), generate per‑table balance plans, and execute them with a safe workflow (pre‑check, fetch, detach, attach, drop) plus rate‑limiting.

Application‑level optimizations include query push‑down (using distributed_group_by_no_merge and cluster‑view tricks to move heavy calculations to shards), adding jump‑index support for Array/Map columns, and selecting ZSTD(1) compression to reduce storage by >30 %.

The article concludes with future directions: extending the unified aggregation model to other business logs, and deploying Z‑order indexes to improve multi‑dimensional filtering in ClickHouse.

big dataFlinkClickHousedata warehouseuser behavior analysisiceberg
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.