BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook
This article introduces ByteDance's open‑source data integration engine BitSail, covering its background, layered architecture, recent feature enhancements, automated testing framework, CDC‑based full‑library synchronization solutions, and future development plans for connectors and real‑time data consistency.
With the rapid growth of big data, data integration—moving data from system A to system B—has become a foundational step for building reliable data pipelines. BitSail ("Bit Sail") is ByteDance's open‑source, distributed, high‑performance data integration engine designed to handle offline, real‑time, full‑load and incremental synchronization across heterogeneous sources.
Background and Evolution
Before 2018, ByteDance lacked a unified integration framework, requiring custom sync channels for each source.
2019: Unified batch integration based on Flink.
2020: Added streaming support, achieving a batch‑stream unified architecture.
2021: Integrated with Hudi for near‑real‑time data lake ingestion.
2022: Open‑sourced BitSail after extensive internal validation.
Three‑Layer Architecture
Connector Layer : Handles source and sink adapters for relational databases, message queues, Hive, ClickHouse, etc.
Framework Layer : Provides type conversion, dirty‑data handling, flow control, auto‑parallelism, and monitoring.
Engine Layer : Executes distributed task scheduling and data transfer.
Recent Feature Highlights
Data synchronization architecture with clear separation of source, sink and engine.
Modular code structure mirroring the three layers, enabling plug‑and‑play connectors.
Multi‑engine support to reduce reliance on Flink and lower operational costs.
Evolution from ETL → ELT → EtLT, introducing a Transform module for lightweight field‑level operations and dimension‑table joins.
Automated testing engine that builds M×N connector combinations, runs distributed tests, and integrates with CI/CD.
CDC (Change Data Capture) Solution
Captures binlog changes from MySQL, SQL Server, PostgreSQL, etc., providing higher real‑time fidelity and lower impact on production systems.
Use cases include syncing to data warehouses (Hive, ClickHouse), MPP databases (Doris, StarRocks) for near‑real‑time analytics, and Elasticsearch for online search.
Full‑library sync workflow: (1) CDC Batch for initial bulk load, (2) real‑time incremental capture via Debezium → Kafka, (3) partition‑level data merging, (4) sink operators write change logs downstream.
Benefits: reduced latency (seconds‑level), lower operational complexity (automatic table creation, single‑task orchestration), simplified handling of sharding, and strong consistency through binlog positions and custom ordering.
Future Outlook
Expand connector ecosystem and provide lighter distributed compute engines.
Enhance CDC capabilities with automatic DDL sync and end‑to‑end data consistency verification.
Q&A Highlights
Custom engine under POC shows promising performance gains over Flink‑based sync.
Flink’s heavy dependency leads to resource waste in pure data‑integration scenarios.
Both record‑count and byte‑based flow control are supported.
CDC full‑library sync on Volcano Engine offers wizard‑style configuration and scales with data volume.
Exactly‑once semantics and state handling ensure data accuracy in real‑time pipelines.
Overall, BitSail provides a comprehensive, open‑source solution for large‑scale data integration, combining flexible architecture, robust CDC capabilities, and automated testing to meet modern big‑data demands.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.