How ByteDance’s BitSail is Revolutionizing Data Integration at Scale
BitSail, ByteDance’s open‑source data integration engine built on Flink, has evolved through three major versions to support batch, streaming and CDC modes, handling over 200,000 daily tasks across 20+ data sources, and aims to meet real‑time, cloud‑native integration demands.
Background and Open‑Source Release
Since the establishment of the Open‑Source Committee at ByteDance, the company has released several projects, including Shuffle, Cloud Shuffle Service, Volo, and the data integration engine BitSail, which was open‑sourced on 26 Oct 2022 under the Apache 2.0 license.
BitSail originated from the internal Data Transmission Service (DTS) built on Apache Flink and has become a core component of the DataLeap suite.
Evolution of the Engine
BitSail’s development can be divided into three stages:
V1.0 (2018‑2019) : Batch‑only integration using Flink Batch, supporting ~20 heterogeneous data sources.
V2.0 (2020‑2021) : Added real‑time MQ‑to‑Hive/HDFS pipelines, achieving unified stream‑batch processing.
V3.0 (2022‑present) : Introduced CDC, real‑time lake integration, and cloud‑native optimisations for Kubernetes.
Key technical challenges included improving Flink checkpoint reliability for high‑concurrency jobs (up to 5,000 parallelism) and adopting Flink Regional Checkpoint to raise success rates from 60 % to 90 %.
Current Architecture
The engine now supports three synchronization modes—batch, streaming, and incremental (CDC)—covering offline, real‑time, full‑load and incremental scenarios.
Batch mode uses Flink Batch to move data between over 20 source types.
Streaming mode streams data from MQ to Hive or HDFS with high stability and low latency.
CDC mode captures database binlog changes and synchronises them to downstream systems, currently supporting five source types but handling massive task volumes.
Supported connectors include MySQL, Oracle, MongoDB, Kafka, RocketMQ, HDFS, Hive, ClickHouse, among others, totaling more than 20 source types.
Daily, the engine processes over 200,000 tasks, moving more than a hundred‑trillion rows, with single batch jobs reaching billions of rows and streaming jobs achieving tens of thousands QPS and 10‑minute SLA latency.
Comparison with Other Projects
Compared with open‑source alternatives such as Apache SeaTunnel, Apache InLong, Airbyte, DataX, and Sqoop, BitSail distinguishes itself by having been battle‑tested at ByteDance’s massive traffic scale, offering a plugin‑based, Flink‑agnostic runtime, and providing out‑of‑the‑box support for more than 20 connectors, automatic type conversion, dirty‑data handling, flow control, and streaming archiving.
Future Directions
Short‑term plans focus on strengthening core capabilities—expanding connector coverage, improving observability, and simplifying runtime deployment. In the medium‑to‑long term, the team aims to develop a lighter‑weight native runtime, enhance lake‑integration, and support elastic scaling in cloud‑native environments.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
