Optimizing Flink Real‑Time Computing at Bilibili: Connector Stability, SQL, Runtime, and Future Outlook
This article details Bilibili's comprehensive optimization of Flink real‑time computing, covering connector stability improvements, SQL interval‑join enhancements, runtime state and checkpoint refinements, a diagnostic tool, and future directions for high‑throughput streaming workloads.
Overview
Bilibili runs Flink on YARN to support three major scenarios: Saber data computation, Lancer data transmission, and Magento data lake. Saber hosts thousands of SQL and JAR jobs, Lancer streams app and server logs into Kafka and Hive, and Magento writes upstream data to Iceberg or Hudi tables, supporting AI, commercial products, and data warehousing workloads.
Flink‑Connector Stability Optimization
1. HDFS hotspot issue
Background: Writing many parallel Flink streams to HDFS creates a hotspot where closing individual files becomes a bottleneck during checkpoints.
Solutions:
Reduce file‑close frequency by keeping files open during snapshots and truncating on recovery.
Asynchronously close files, moving timed‑out closes to a background queue.
Parallelize snapshot processing across multiple buckets using multithreading.
2. Partition‑ready judgment problem
Background: Early partition commit can occur when files remain unclosed, causing Hive commit errors.
Solutions:
Commit only when all bucket open/close states match.
Introduce a PartitionRecover module that advances watermarks for idle streams and forces partition creation after timeout.
3. General sink stability improvements
Introduce a pluggable sink architecture with partitioner, batch pool, failover/retry policy, connector management, and a ClickHouse implementation.
Flink SQL Optimization
1. Interval join enhancements
Background: AI model training requires joining feed and click streams with large volumes; native interval join supports only a single join and performs poorly at high throughput.
Solutions:
Add a global_join type to support multiple joins per window.
Extend JoinRuleSet to detect dual‑window conditions and route to a new join class.
Cache stream data in an external KV store (e.g., Redis) to improve lookup speed.
2. Custom Kafka stream splitting
Background: Real‑time data needs to be routed to multiple Kafka topics based on project_id using a UDF.
Solution: Implement a UDF that parses project_id and directs records to topics a, b, or c.
Flink Runtime Optimization
1. State backend selection
Evaluation of Redis vs. RocksDB for large state caches. Redis offers fast in‑memory access but lacks checkpoint integration, while RocksDB provides durable LSM‑tree storage with checkpoint‑friendly SST files.
Chosen solution: Use RocksDB as the primary state backend.
2. RocksDB compaction tuning
Problems with CPU‑intensive compaction at scale.
Solutions:
FIFO compaction for low‑load latency‑join workloads.
Enable Bloom filter policy to speed up key lookups.
Distribute RocksDB files across disks using round‑robin allocation.
3. Global parallelism simplification
Introduce a single global parallelism setting, auto‑detect leaf‑node parallelism, and rebalance based on Kafka partition counts.
4. Warship intelligent diagnosis tool
Collects metrics from YARN, Elasticsearch logs, Flink task metrics, and machine stats, applies heuristic scoring (e.g., checkpoint failure counts, timeout ratios), and generates alerts with resource‑adjustment recommendations. Also provides log‑keyword translation for easier debugging.
5. Checkpoint metadata atomicity
Ensures metadata files are written atomically by using an in‑progress file that is renamed only after successful completion, allowing safe recovery from incomplete checkpoints.
Outlook
Implement "best‑effort" HDFS sink compaction for Hive tables.
Support three‑stream joins in Flink.
Accelerate Flink job recovery to reduce Kafka downtime.
Q&A
Answers cover Flink job scheduling platforms, UDF deployment, back‑pressure handling, and version management via the Saber platform.
Thank you for attending.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.