Big Data 16 min read

How Baidu Netdisk Built a High‑Performance Real‑Time Engine with Flink

This article explains how Baidu Netdisk transitioned from Spark Streaming to a Flink‑based Tiangong real‑time computing engine, detailing the evolution, reasons for choosing Flink, architecture, configuration examples, business use cases, technical challenges, and future platform plans.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
How Baidu Netdisk Built a High‑Performance Real‑Time Engine with Flink

Overview

With digital transformation, enterprises increasingly demand real‑time data services. In large‑scale and complex scenarios, Flink plays a crucial role in the real‑time data pipeline. This article describes how Baidu Netdisk built a high‑performance, low‑latency, and stable real‑time computing engine using Flink.

Evolution of Baidu Netdisk Real‑Time Computing

In 2020, Netdisk relied on Spark Streaming and Spark Structured Streaming for data synchronization and real‑time cleaning. To address Spark Streaming’s weak monitoring, high integration cost, and low timeliness, Netdisk introduced Flink in early 2023, quickly establishing metric monitoring, alerting, and task lifecycle management on the internal StreamCompute platform. However, the high entry barrier of Flink prompted the development of a configurable, low‑code Tiangong engine.

Why Flink?

Flink was chosen because it aligns with Baidu’s internal real‑time roadmap, supports state management, stream‑batch integration, comprehensive monitoring, alerting, task management, and a rich ecosystem, making it suitable for Netdisk’s real‑time needs.

Tiangong Real‑Time Engine Architecture

The engine consists of three layers:

Runtime Layer : Deploys tasks via StreamCompute, Kubernetes, Yarn, or Local modes.

Source & Sink Components : Support databases, message queues, big‑data components, and custom sources/sinks.

Data Transformation Engine : Provides stream‑batch integration, configurable cleaning, exactly‑once processing, fault tolerance, IOC container management, custom SQL topologies, and flexible monitoring.

Typical job configuration (JSON) looks like:

{
  "jobName": "example_job",
  "env": {"streamConfigName": "ck_10s_5fail"},
  "sources": [{"configType": "CONFIG", "sourceTableName": "source_table", "sourceConfig": "prod/source_cfg"}],
  "udfs": [],
  "views": [],
  "coreSql": [],
  "sinks": []
}

Business Scenario Practice

Tiangong is widely used in data teams, anti‑fraud, and user growth, handling millions of events per second on a cluster of 1,500 machines (5,800 CU). Key scenarios include real‑time data warehouse, complex aggregation, and CDC. For example, the real‑time commercial BI center ingests checkout, order, and experiment data at second granularity, enabling minute‑level experiment evaluation and faster revenue growth.

Configuration for a commercial feature pipeline:

{
  "jobName": "business_feature_compute",
  "env": {"streamConfigName": "300p_ck_30s_5fail", "tableConfig": {"stateTtlMs": 600000}},
  "sources": [{"configType": "CONFIG", "sourceTableName": "idc_log_source", "sourceConfig": "prod/business_strategy_idc_bp_source"}],
  "udfs": [{"name": "idc_log_filter_func", "className": "com.baidu.xxx.IdcLogFilterFunction"}],
  "views": [{"name": "idc_log_feature_view", "sql": "SELECT ... FROM ..."}],
  "sinks": [{"sinkConfigNames": ["prod/netdisk_strategy_idc_feature_mi_table_sink","prod/netdisk_strategy_feature_doris_sink"], "watermarkConfig": {"maxOutOfOrdernessMs": 5000, "idlenessMs": 10000}}]
}

Technical Challenges and Solutions

Large State Problem : Storing cumulative data in Flink state avoids network I/O but leads to huge checkpoints. Solutions include using RocksDB as the state backend with incremental checkpoints, enabling changelog state, and tuning RocksDB memory and write buffers.

TableStorage Write Performance : High thread count for Table client caused container thread limits. Reducing RPC threads and increasing container limits mitigated the issue, while fine‑grained task splitting and windowed aggregation reduced write volume dramatically.

Flink Technical Challenges and Solutions

Operator Parallelism Optimization : Flink SQL ties operator parallelism to source parallelism, causing resource waste. Tiangong decouples them, allowing higher parallelism for cleaning operators without scaling sources.

Partition Relation Optimization : Replacing default Rebalance connections with Rescale reduces logical connections from 500 × 200 to 500, lowering network buffer usage.

Resource Sharing Strategy : Grouping operators by type into slot groups minimizes cross‑network data transfer and improves resource utilization.

Operator Naming : Custom execution plans allow meaningful operator names for easier maintenance.

Future Outlook

Currently, Tiangong jobs are submitted via code repositories, requiring code‑base permissions and leading to security and isolation concerns. The roadmap includes building a low‑code real‑time platform for easy job creation and a real‑time DTS platform that replaces the fragile binlog‑AFS‑UDW pipeline.

Architecture Diagrams

Tiangong Architecture
Tiangong Architecture
Flink Base Construction
Flink Base Construction
Operator Parallelism Optimization
Operator Parallelism Optimization
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkstream processingReal‑Time ComputingBaidu NetdiskTiangong
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.