BitSail: ByteDance’s Open‑Source Unified Data Integration Engine – Architecture, Evolution, and Capabilities
BitSail, an open‑source data integration engine from ByteDance, provides a unified solution for batch, streaming, full‑load, and incremental data synchronization across heterogeneous sources, detailing its background, technical evolution, architecture, low‑cost co‑building features, compatibility strategies, and future roadmap.
1. Introduction
BitSail is ByteDance’s open‑source data integration engine that supports synchronization among heterogeneous data sources in offline, real‑time, full‑load, and incremental scenarios, and has been validated for performance and stability across many internal and Volcano Engine customers.
On October 26, ByteDance announced BitSail’s official open‑source release on GitHub to lower data‑building costs and enable efficient data value creation.
2. ByteDance Internal Data Integration Background
Data‑driven principles drive ByteDance’s data platform, with data integration forming the foundation of the data middle‑platform, addressing transmission, processing, and transformation of heterogeneous sources.
BitSail originated from the internally developed Data Transmission Service (DTS) built on Apache Flink, evolving into a framework that supports batch, streaming, and incremental modes, distributed horizontal scaling, and a plug‑in architecture for flexible source integration.
3. BitSail Evolution
3.1 Three‑Stage Evolution
Initial stage (pre‑2018): No unified framework; scattered use of MapReduce, Spark, and mesh‑like source connections, leading to high development and operation costs.
Growth stage (2018‑2022): Adoption of Flink, batch‑stream integration, and later Hudi for CDC and lake‑warehouse integration.
Mature stage (2022‑present): Stable architecture validated in production, now open‑sourced to benefit external developers.
3.2 Technical Architecture Evolution
3.2.1 Flink‑Based Heterogeneous Source Transfer
The core abstracts input as BaseInput and output as BaseOutput, providing services such as type system, auto parallelism, flow control, and dirty data detection.
Improvements include exposing split‑level progress via metrics and separating source and operator layers to provide accurate task progress.
3.2.2 Batch‑Stream Unified Architecture
Upgraded Flink from 1.5 to 1.9, unified APIs to DataStream, added real‑time sources, exactly‑once guarantees, event‑time handling, auto DDL, speculative execution, and region failover, and supported cloud‑native deployment.
3.2.3 Lake‑Warehouse Unified Architecture
Integrated Hudi to achieve near‑real‑time CDC synchronization, introduced Copy‑On‑Write (COW) and Merge‑On‑Read (MOR) table formats, and optimized compaction to reduce latency and improve throughput.
3.3 Practical Experience
Key lessons include selecting appropriate table types (MOR for CDC), optimizing Hudi write paths by replacing Flink state with hash index, decoupling compaction into offline jobs, and merging task and Hudi caches to shorten checkpoint times, achieving million‑level QPS and sub‑minute checkpoint latency.
4. BitSail Capabilities
4.1 Low‑Cost Co‑building
Modularization separates framework, engine, and source layers, and an abstracted plug‑in interface reduces connector development effort.
4.2 Compatibility
Multi‑engine support (Flink, Spark, Local Engine) and dependency isolation via provided dependencies and dynamic component loading enable flexible deployment across diverse environments.
5. Future Outlook
Plans include expanding multi‑engine support with intelligent engine selection, promoting generic interfaces to hide engine details, exploring multi‑language connectors, and delivering a unified CDC‑to‑lake solution that sustains tens of millions of QPS.
6. Activity Preview
ByteDance Data Platform will host a live BitSail session on November 9 at 19:30, featuring experts who will dive into technical practices, open‑source roadmap, and hands‑on guidance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
