Big Data 19 min read

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, ByteDance’s open‑source data integration engine, unifies batch, streaming, and incremental data synchronization across heterogeneous sources, detailing its evolution from early Flink‑based prototypes to a mature, plugin‑driven architecture with multi‑engine support, low‑cost co‑development, and robust CDC lakehouse capabilities.

DataFunSummit
DataFunSummit
DataFunSummit
BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

1. Introduction

BitSail is ByteDance’s open‑source data integration engine that supports synchronization among heterogeneous data sources and provides solutions for offline, real‑time, full‑load and incremental scenarios, currently serving many internal and external customers.

On October 26, ByteDance announced the open‑source release of BitSail on GitHub to lower data‑building costs and enable data to create value efficiently. This article covers four parts: internal data‑integration background, BitSail’s technical evolution, capability analysis, and future outlook.

2. Internal Data‑Integration Background

ByteDance emphasizes a data‑driven approach; data integration is the foundation of its data‑mid platform, solving transmission, processing, and transformation of heterogeneous sources.

BitSail originated from the internally developed Data Transmission Service (DTS) built on Apache Flink, offering batch, streaming, and incremental sync modes, distributed horizontal scaling, and a plug‑in architecture for flexible source integration.

3. Evolution of BitSail

3.1 Three‑Stage Evolution of the Global Data‑Integration Engine

Initial stage (pre‑2018): No unified framework, scattered engines such as MapReduce and Spark, high development and O&M cost.

Growth stage: 2018‑2019 Flink ecosystem maturation, adoption of Flink for heterogeneous source transmission and batch unification; 2020‑2021 batch‑stream unification with Flink’s unified API; 2021‑2022 integration of Hudi lakehouse to solve CDC real‑time sync.

Mature stage (2022‑present): Stable architecture validated across ByteDance business lines, open‑sourced to reduce data‑construction cost and enable broader adoption.

3.2 Technical Architecture Evolution

3.2.1 Flink‑Based Heterogeneous Source Transfer

Implemented on Flink 1.5 DataSet API, supporting only batch. Core concepts include BaseInput for pulling source data and BaseOutput for writing to external systems, along with services such as type system, auto parallelism, flow control, and dirty‑data detection.

To address progress‑monitoring pain points, the architecture was refactored to expose split‑level metrics: the Source layer reports total and completed splits, while the Operator layer computes progress by comparing processed records to upstream output, with a gradient limit ensuring downstream progress never exceeds upstream.

3.2.2 Batch‑Stream Unified Architecture

Upgraded Flink from 1.5 to 1.9 and migrated from DataSet API to DataStream API to support unified batch‑stream processing. Added real‑time sources (e.g., message queues), Exactly‑Once guarantees, event‑time handling, auto DDL, and engine‑level features such as speculative execution and region failover. The runtime also embraced cloud‑native deployment.

In a typical real‑time pipeline (MQ → Hive), the original shuffle‑based design introduced a single‑concurrency commit node, causing global restarts on task failover. The refactored pipelined design moved metadata commits to the JobManager via an Aggregate Manager, eliminating the single‑point bottleneck and improving stability for large data volumes.

3.2.3 Flink‑Based Lakehouse Architecture

Introduced a lakehouse architecture to achieve near‑real‑time CDC sync. The original three‑module pipeline (batch pull, real‑time changelog, offline merge) suffered from T+1 latency, high storage overhead, and costly global shuffles during merges.

Upgrades include moving to Flink 1.11, integrating Hudi for efficient upserts and indexing, and optimizing Hudi’s write path: replacing Flink State with a lightweight Hash Index, decoupling compaction into an offline scheduled task, and merging Task and Hudi caches to shorten checkpoint duration.

Post‑optimization, the system supports millions of QPS, end‑to‑end checkpoint latency under one minute, and a 99% checkpoint success rate.

3.3 Practical Experience from Architecture Evolution

Table‑type selection: For CDC workloads with heavy random updates, Merge‑On‑Read (MOR) tables are preferred over Copy‑On‑Write (COW) to avoid write amplification.

Hudi write‑path pain points and solutions: Replaced Flink State with Hash Index, isolated compaction as an offline job, and merged task and Hudi caches, resulting in stable performance at million‑level QPS.

4. Capability Analysis

BitSail offers two major capabilities: low‑cost co‑development and architectural compatibility.

4.1 Low‑Cost Co‑Development

By splitting the monolithic JAR into independent modules (engine, source, framework) and adopting a plug‑in design, developers can contribute new connectors or features without deep engine knowledge.

Abstracted read/write interfaces decouple user code from Flink APIs, allowing new connectors to be built on top of a stable abstraction layer.

4.2 Compatibility Capability

To handle diverse big‑data stacks, BitSail provides a multi‑engine architecture (future Spark and Local Engine support) and dependency isolation via provided dependencies, Maven profiles, and dynamic loading of source components.

5. Future Outlook

BitSail aims to further deepen three areas: multi‑engine support with intelligent engine selection, generic capability construction (new interfaces, multi‑language connectors), and streaming data‑lake solutions that sustain tens of millions of QPS for CDC ingestion.

Overall, BitSail delivers a mature, extensible, and high‑performance data‑integration platform that reduces cost, improves stability, and enables unified batch, streaming, and incremental data pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkopen‑sourceCDC
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.