How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing
Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.
Background
In Bilibili's data‑warehouse layered architecture, converting data from the ODS layer to the DWD layer requires cleaning, masking, and columnar compression. Daily increments exceed 20 TB and billions of rows, causing severe resource consumption because the existing Spark offline sharding solution reads the entire ODS tables for each downstream split table, leading to read‑amplification, high resource usage, and latency.
Spark Offline Sharding Issues
Read‑amplification from repeated full‑table scans.
Resource consumption grows with the number of split tables.
Synchronization latency can exceed one hour during peak periods.
Flink Real‑Time Incremental Computation
The new solution replaces Spark batch sharding with a Flink job that uses an HDFS file source to consume only newly visible files generated by Lancer checkpoints. The job joins dimension tables, widens records, and writes to a Flink Multi‑Hive sink. A downstream Spark job then merges small files for the main table.
Benefits:
Read‑amplification eliminated: Flink partitions output without re‑reading source data.
Resource reduction: Each source file is ingested once; downstream tasks start as soon as files become visible.
Latency improvement: DWD synchronization completes within a single checkpoint interval, moving from hour‑level to minute‑level latency.
Multi‑Level Partition Small‑File Solution
Real‑time sharding generated over 100× more files than the offline approach, causing Spark OOM failures. Two techniques were applied.
4.1 Flink Partitioner Shuffle Optimization
With 270+ fourth‑level partitions, naive full‑parallelism would create ~140 million files per day (270 × 1800 × 12 × 24). By tagging each record with its fourth‑level partition and assigning tags to subtasks proportionally to historical data size, the file count dropped to ~1.5 million per day (5000 × 12 × 24), a >100× reduction.
4.2 Auto‑Shuffle Speculative Execution
Dynamic shuffle scheme to avoid hotspot subtasks:
Custom bucket‑tag rules (UDF‑configurable).
Row size computed from raw byte length.
Rolling‑window + knapsack‑style algorithm balances bucket allocation.
Unified dictionary sorting of tags to co‑locate identical tags.
Stateful model preserves distribution across restarts.
Cold‑start handling via predefined tag ratios for O(1) dispatch.
Deployment Challenges and Solutions
5.1 Stability Enhancements
JobManager metrics disabled: Prevents OOM caused by excessive metric collection.
JobManager HA: High‑availability keeps TaskManagers running while the JobManager restarts.
Backlog‑based load balancing: Rebalance/rescale partitioners distribute data according to downstream subtask capacity.
Reader file‑split load balancing: A monitor thread assigns idle readers file splits to avoid hotspot accumulation.
5.2 Fast Fail‑Over
Regional checkpointing enabled ( execution.checkpointing.regional.enabled=true and related parameters) allows the job to succeed even if some subtasks fail.
Dimension‑table join fail‑over using a JVM shutdown hook preserves TM‑level caches.
YARN session submission retains allocated containers during restarts, preventing resource pre‑emption.
5.3 Data‑Quality Guarantees
Two‑phase commit for file sinks ensures exactly‑once semantics.
Split‑distribution tracking guarantees no duplicate or missing splits.
Dimension‑table load degradation falls back to the previous partition on error.
File‑lock mechanisms provide atomicity for HDFS dimension loads.
Results
≈20 % overall resource consumption reduction and ≈46 % peak reduction.
Partition archiving time improved by 20 %; 99 % of streams synchronized within 30 minutes, 50 % within 17 minutes.
Scalable multi‑table, multi‑level partitioning without additional resources.
Future Outlook
Integrate Flink with Apache Hudi to further reduce read volume via clustering, completing a unified stream‑batch data‑warehouse architecture.
References
https://mp.weixin.qq.com/s/PQYylmHBjnnH9pX7-nxvQA
https://mp.weixin.qq.com/s/E23JO7YvzJrocbOIGO5X-Q
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/sourcessinks/
https://mp.weixin.qq.com/s/O0AXF74j6UvjtPQp5JQrTw
https://mp.weixin.qq.com/s/NawxeiP-_DFpyoekRrzlLQ
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
