Big Data 18 min read

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

BitSail, an open‑source data integration engine from ByteDance, provides a unified solution for batch, streaming, full‑load, and incremental data synchronization across heterogeneous sources, detailing its background, technical evolution, architecture, low‑cost co‑building features, compatibility strategies, and future roadmap.

DataFunTalk

Nov 6, 2022

BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities

1. Introduction

BitSail is ByteDance’s open‑source data integration engine that supports synchronization among heterogeneous data sources in offline, real‑time, full‑load, and incremental scenarios, and has been validated for performance and stability across many internal and Volcano Engine customers.

On October 26, ByteDance announced BitSail’s official open‑source release on GitHub to lower data‑building costs and enable efficient data value creation.

2. ByteDance Internal Data Integration Background

Data‑driven principles drive ByteDance’s data platform, with data integration forming the foundation of the data middle‑platform, addressing transmission, processing, and transformation of heterogeneous sources.

BitSail originated from the internally developed Data Transmission Service (DTS) built on Apache Flink, evolving into a framework that supports batch, streaming, and incremental modes, distributed horizontal scaling, and a plug‑in architecture for flexible source integration.

3. BitSail Evolution

3.1 Three‑Stage Evolution

Initial stage (pre‑2018): No unified framework; scattered use of MapReduce, Spark, and mesh‑like source connections, leading to high development and operation costs.

Growth stage (2018‑2022): Adoption of Flink, batch‑stream integration, and later Hudi for CDC and lake‑warehouse integration.

Mature stage (2022‑present): Stable architecture validated in production, now open‑sourced to benefit external developers.

3.2 Technical Architecture Evolution

3.2.1 Flink‑Based Heterogeneous Source Transfer

The core abstracts input as BaseInput and output as BaseOutput, providing services such as type system, auto parallelism, flow control, and dirty data detection.

Improvements include exposing split‑level progress via metrics and separating source and operator layers to provide accurate task progress.

3.2.2 Batch‑Stream Unified Architecture

Upgraded Flink from 1.5 to 1.9, unified APIs to DataStream, added real‑time sources, exactly‑once guarantees, event‑time handling, auto DDL, speculative execution, and region failover, and supported cloud‑native deployment.

3.2.3 Lake‑Warehouse Unified Architecture

Integrated Hudi to achieve near‑real‑time CDC synchronization, introduced Copy‑On‑Write (COW) and Merge‑On‑Read (MOR) table formats, and optimized compaction to reduce latency and improve throughput.

3.3 Practical Experience

Key lessons include selecting appropriate table types (MOR for CDC), optimizing Hudi write paths by replacing Flink state with hash index, decoupling compaction into offline jobs, and merging task and Hudi caches to shorten checkpoint times, achieving million‑level QPS and sub‑minute checkpoint latency.

4. BitSail Capabilities

4.1 Low‑Cost Co‑building

Modularization separates framework, engine, and source layers, and an abstracted plug‑in interface reduces connector development effort.

4.2 Compatibility

Multi‑engine support (Flink, Spark, Local Engine) and dependency isolation via provided dependencies and dynamic component loading enable flexible deployment across diverse environments.

5. Future Outlook

Plans include expanding multi‑engine support with intelligent engine selection, promoting generic interfaces to hide engine details, exploring multi‑language connectors, and delivering a unified CDC‑to‑lake solution that sustains tens of millions of QPS.

6. Activity Preview

ByteDance Data Platform will host a live BitSail session on November 9 at 19:30, featuring experts who will dive into technical practices, open‑source roadmap, and hands‑on guidance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Streaming Open-source Data Integration CDC

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.