Big Data 19 min read

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

This article details Bilibili’s migration from a Spark‑based offline ODS‑to‑DWD sharding process to a Flink real‑time incremental pipeline, explaining the background challenges, the design of multi‑level partitioning, small‑file optimizations, stability enhancements, and the measurable performance gains achieved.

Bilibili Tech

Apr 4, 2023

How Bilibili’s Flink‑Based Real‑Time Incremental Pipeline Cuts Costs and Boosts Latency

Background

In Bilibili’s data‑warehouse layered architecture, converting data from the ODS layer to the DWD layer requires cleaning, masking, and columnar compression. With daily growth of billions of rows and over 20 TB of incremental data, repeated ingestion caused severe resource consumption. To mitigate this, the "Polaris" (North Star) sharding mechanism was introduced, partitioning tables by department, event type (pv, show, click, etc.), and product source, thereby reducing downstream file reads.

Spark Offline Sharding Issues

The Spark hourly job triggers ETL to perform DWD sharding and data sync. Large source tables cause Spark cache misses, leading to full‑table reads and read‑amplification. As more departmental sharding tables are added, the same source data is read repeatedly, inflating resource usage, and the partition‑notification model results in DWD sync latency exceeding one hour during peak periods.

Flink Real‑Time Incremental Computation

The new solution uses Flink HDFS File Source to scan checkpoint‑generated visible files from Lancer, joins dimension tables, widens the data, and writes to a Flink Multi‑Hive Sink. Archer, Bilibili’s task scheduler, then dispatches the data to downstream search, recommendation, and advertising jobs. Because the main table still generates many small files, a Spark‑based small‑file merge step is added after the sink.

Multi‑Level Partition Small‑File Solution

Real‑time sharding created over 100× more files than the offline approach, causing Spark analysis jobs to run out of memory. The root cause was excessive bucket allocation for four‑level partitions. By applying a Flink Partitioner shuffle that tags rows with partition IDs and distributes them based on historical data size ratios, the daily file count dropped from ~140 M to ~1.5 M, a reduction of more than 100×.

Before optimization: 270 (four‑level partitions) × 1800 (parallelism) × 12 (checkpoint cuts per hour) × 24 = 139,968,000 files/day.

After optimization: 5,000 (shuffle slots) × 12 × 24 = 1,440,000 files/day.

Auto Shuffle Speculative Execution

Custom bucket‑tag rules : Users can define bucket tags via UDF based on row fields.

Row size calculation : Row size is computed from its raw byte length.

Rolling window + knapsack algorithm + unified dictionary sorting : A circular array records quotas; sub‑tasks receive bucket updates without global visibility, preventing single buckets from exceeding 8 GB. Tags are sorted by hashcode to co‑locate similar tags, reducing shuffle cost.

Weighted bag algorithm : Similar to Flink 1.12’s BinPack, but adds tag identification. Adjustments were made to avoid back‑pressure‑induced quota reductions.

Ratio model state maintenance : On recovery, the last saved ratio model is used to distribute subtasks, preventing file‑count spikes.

Cold‑start handling : Pre‑defined tag rules and ratios enable O(1) distribution before any traffic data is available.

Deployment Challenges and Optimizations

In a mixed‑node cluster with scarce resources, several stability and performance issues were addressed:

JobManager stability : Disabled non‑essential metrics to avoid OOM, and enabled HA with fail‑over to keep TaskManagers alive.

Subtask load balancing : Implemented a backlog‑based rebalance/re‑scale partitioner to evenly distribute data across subtasks of differing parallelism.

Subtask hotspot mitigation : Added Reader File Split load‑balancing where idle readers receive more splits, reducing skew.

Fast fail‑over : Regional checkpointing (execution.checkpointing.regional.enabled=true, etc.) allows the job to succeed even if some subtasks fail, speeding up recovery.

Dimension‑table join fail‑over : Used a JVM shutdown hook to preserve TM‑level caches when a slot fails, enabling region‑level fail‑over without full reload.

Yarn resource preemption : Adopted session submission to retain allocated containers during task drift, preventing resource starvation.

Data‑quality guarantees : Implemented two‑phase commit for File Source, exactly‑once split distribution, and exactly‑once RowData conversion. Added file‑lock mechanisms for atomic HDFS dimension‑table loads and integrated Archer commit policy for proactive downstream notification.

Results

After rolling out the Flink real‑time incremental pipeline, overall resource consumption dropped about 20% and peak usage fell roughly 46%. Latency improved from hour‑level to minute‑level; the average hourly partition archiving time increased by 20%, with 99% of lines syncing within 30 minutes and 50% within 17 minutes. Partition scalability was also enhanced, allowing further multi‑table, multi‑level sharding without additional resources.

Future Outlook

Looking ahead, the team plans to combine Flink with Apache Hudi to achieve a unified batch‑stream data‑lake, leveraging Hudi’s clustering capability to further reduce read volumes and accelerate query performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Real-time Big Data Flink Data Warehouse Incremental Processing

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.