Big Data 14 min read

How ByteDance’s BitSail is Revolutionizing Data Integration at Scale

BitSail, ByteDance’s open‑source data integration engine built on Flink, has evolved through three major versions to support batch, streaming and CDC modes, handling over 200,000 daily tasks across 20+ data sources, and aims to meet real‑time, cloud‑native integration demands.

ByteDance Data Platform
ByteDance Data Platform
ByteDance Data Platform
How ByteDance’s BitSail is Revolutionizing Data Integration at Scale

Background and Open‑Source Release

Since the establishment of the Open‑Source Committee at ByteDance, the company has released several projects, including Shuffle, Cloud Shuffle Service, Volo, and the data integration engine BitSail, which was open‑sourced on 26 Oct 2022 under the Apache 2.0 license.

BitSail originated from the internal Data Transmission Service (DTS) built on Apache Flink and has become a core component of the DataLeap suite.

Evolution of the Engine

BitSail’s development can be divided into three stages:

V1.0 (2018‑2019) : Batch‑only integration using Flink Batch, supporting ~20 heterogeneous data sources.

V2.0 (2020‑2021) : Added real‑time MQ‑to‑Hive/HDFS pipelines, achieving unified stream‑batch processing.

V3.0 (2022‑present) : Introduced CDC, real‑time lake integration, and cloud‑native optimisations for Kubernetes.

Key technical challenges included improving Flink checkpoint reliability for high‑concurrency jobs (up to 5,000 parallelism) and adopting Flink Regional Checkpoint to raise success rates from 60 % to 90 %.

Current Architecture

The engine now supports three synchronization modes—batch, streaming, and incremental (CDC)—covering offline, real‑time, full‑load and incremental scenarios.

Batch mode uses Flink Batch to move data between over 20 source types.

Streaming mode streams data from MQ to Hive or HDFS with high stability and low latency.

CDC mode captures database binlog changes and synchronises them to downstream systems, currently supporting five source types but handling massive task volumes.

Supported connectors include MySQL, Oracle, MongoDB, Kafka, RocketMQ, HDFS, Hive, ClickHouse, among others, totaling more than 20 source types.

Daily, the engine processes over 200,000 tasks, moving more than a hundred‑trillion rows, with single batch jobs reaching billions of rows and streaming jobs achieving tens of thousands QPS and 10‑minute SLA latency.

ByteDance data integration evolution timeline
ByteDance data integration evolution timeline
ByteDance data integration engine architecture
ByteDance data integration engine architecture
ByteDance data integration status
ByteDance data integration status

Comparison with Other Projects

Compared with open‑source alternatives such as Apache SeaTunnel, Apache InLong, Airbyte, DataX, and Sqoop, BitSail distinguishes itself by having been battle‑tested at ByteDance’s massive traffic scale, offering a plugin‑based, Flink‑agnostic runtime, and providing out‑of‑the‑box support for more than 20 connectors, automatic type conversion, dirty‑data handling, flow control, and streaming archiving.

Future Directions

Short‑term plans focus on strengthening core capabilities—expanding connector coverage, improving observability, and simplifying runtime deployment. In the medium‑to‑long term, the team aims to develop a lighter‑weight native runtime, enhance lake‑integration, and support elastic scaling in cloud‑native environments.

cloud-nativereal-time processingFlinkopen-sourceData integration
ByteDance Data Platform
Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.