Big Data 18 min read

Optimizing Flink Real‑Time Computing at Bilibili: Connector Stability, SQL, Runtime, and Future Outlook

This article details Bilibili's comprehensive optimization of Flink real‑time computing, covering connector stability improvements, SQL interval‑join enhancements, runtime state and checkpoint refinements, a diagnostic tool, and future directions for high‑throughput streaming workloads.

DataFunTalk
DataFunTalk
DataFunTalk
Optimizing Flink Real‑Time Computing at Bilibili: Connector Stability, SQL, Runtime, and Future Outlook

Overview

Bilibili runs Flink on YARN to support three major scenarios: Saber data computation, Lancer data transmission, and Magento data lake. Saber hosts thousands of SQL and JAR jobs, Lancer streams app and server logs into Kafka and Hive, and Magento writes upstream data to Iceberg or Hudi tables, supporting AI, commercial products, and data warehousing workloads.

Flink‑Connector Stability Optimization

1. HDFS hotspot issue

Background: Writing many parallel Flink streams to HDFS creates a hotspot where closing individual files becomes a bottleneck during checkpoints.

Solutions:

Reduce file‑close frequency by keeping files open during snapshots and truncating on recovery.

Asynchronously close files, moving timed‑out closes to a background queue.

Parallelize snapshot processing across multiple buckets using multithreading.

2. Partition‑ready judgment problem

Background: Early partition commit can occur when files remain unclosed, causing Hive commit errors.

Solutions:

Commit only when all bucket open/close states match.

Introduce a PartitionRecover module that advances watermarks for idle streams and forces partition creation after timeout.

3. General sink stability improvements

Introduce a pluggable sink architecture with partitioner, batch pool, failover/retry policy, connector management, and a ClickHouse implementation.

Flink SQL Optimization

1. Interval join enhancements

Background: AI model training requires joining feed and click streams with large volumes; native interval join supports only a single join and performs poorly at high throughput.

Solutions:

Add a global_join type to support multiple joins per window.

Extend JoinRuleSet to detect dual‑window conditions and route to a new join class.

Cache stream data in an external KV store (e.g., Redis) to improve lookup speed.

2. Custom Kafka stream splitting

Background: Real‑time data needs to be routed to multiple Kafka topics based on project_id using a UDF.

Solution: Implement a UDF that parses project_id and directs records to topics a, b, or c.

Flink Runtime Optimization

1. State backend selection

Evaluation of Redis vs. RocksDB for large state caches. Redis offers fast in‑memory access but lacks checkpoint integration, while RocksDB provides durable LSM‑tree storage with checkpoint‑friendly SST files.

Chosen solution: Use RocksDB as the primary state backend.

2. RocksDB compaction tuning

Problems with CPU‑intensive compaction at scale.

Solutions:

FIFO compaction for low‑load latency‑join workloads.

Enable Bloom filter policy to speed up key lookups.

Distribute RocksDB files across disks using round‑robin allocation.

3. Global parallelism simplification

Introduce a single global parallelism setting, auto‑detect leaf‑node parallelism, and rebalance based on Kafka partition counts.

4. Warship intelligent diagnosis tool

Collects metrics from YARN, Elasticsearch logs, Flink task metrics, and machine stats, applies heuristic scoring (e.g., checkpoint failure counts, timeout ratios), and generates alerts with resource‑adjustment recommendations. Also provides log‑keyword translation for easier debugging.

5. Checkpoint metadata atomicity

Ensures metadata files are written atomically by using an in‑progress file that is renamed only after successful completion, allowing safe recovery from incomplete checkpoints.

Outlook

Implement "best‑effort" HDFS sink compaction for Hive tables.

Support three‑stream joins in Flink.

Accelerate Flink job recovery to reduce Kafka downtime.

Q&A

Answers cover Flink job scheduling platforms, UDF deployment, back‑pressure handling, and version management via the Saber platform.

Thank you for attending.

OptimizationBig DataFlinkStreamingreal-time computingCheckpoint
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.