Big Data 12 min read

BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook

This article introduces ByteDance's open‑source data integration engine BitSail, covering its background, layered architecture, recent feature enhancements, automated testing framework, CDC‑based full‑library synchronization solutions, and future development plans for connectors and real‑time data consistency.

DataFunSummit

Feb 20, 2024

BitSail Open‑Source Data Integration Engine: Architecture, New Features, CDC Solutions and Future Outlook

With the rapid growth of big data, data integration—moving data from system A to system B—has become a foundational step for building reliable data pipelines. BitSail ("Bit Sail") is ByteDance's open‑source, distributed, high‑performance data integration engine designed to handle offline, real‑time, full‑load and incremental synchronization across heterogeneous sources.

Background and Evolution

Before 2018, ByteDance lacked a unified integration framework, requiring custom sync channels for each source.

2019: Unified batch integration based on Flink.

2020: Added streaming support, achieving a batch‑stream unified architecture.

2021: Integrated with Hudi for near‑real‑time data lake ingestion.

2022: Open‑sourced BitSail after extensive internal validation.

Three‑Layer Architecture

Connector Layer : Handles source and sink adapters for relational databases, message queues, Hive, ClickHouse, etc.

Framework Layer : Provides type conversion, dirty‑data handling, flow control, auto‑parallelism, and monitoring.

Engine Layer : Executes distributed task scheduling and data transfer.

Recent Feature Highlights

Data synchronization architecture with clear separation of source, sink and engine.

Modular code structure mirroring the three layers, enabling plug‑and‑play connectors.

Multi‑engine support to reduce reliance on Flink and lower operational costs.

Evolution from ETL → ELT → EtLT, introducing a Transform module for lightweight field‑level operations and dimension‑table joins.

Automated testing engine that builds M×N connector combinations, runs distributed tests, and integrates with CI/CD.

CDC (Change Data Capture) Solution

Captures binlog changes from MySQL, SQL Server, PostgreSQL, etc., providing higher real‑time fidelity and lower impact on production systems.

Use cases include syncing to data warehouses (Hive, ClickHouse), MPP databases (Doris, StarRocks) for near‑real‑time analytics, and Elasticsearch for online search.

Full‑library sync workflow: (1) CDC Batch for initial bulk load, (2) real‑time incremental capture via Debezium → Kafka, (3) partition‑level data merging, (4) sink operators write change logs downstream.

Benefits: reduced latency (seconds‑level), lower operational complexity (automatic table creation, single‑task orchestration), simplified handling of sharding, and strong consistency through binlog positions and custom ordering.

Future Outlook

Expand connector ecosystem and provide lighter distributed compute engines.

Enhance CDC capabilities with automatic DDL sync and end‑to‑end data consistency verification.

Q&A Highlights

Custom engine under POC shows promising performance gains over Flink‑based sync.

Flink’s heavy dependency leads to resource waste in pure data‑integration scenarios.

Both record‑count and byte‑based flow control are supported.

CDC full‑library sync on Volcano Engine offers wizard‑style configuration and scales with data volume.

Exactly‑once semantics and state handling ensure data accuracy in real‑time pipelines.

Overall, BitSail provides a comprehensive, open‑source solution for large‑scale data integration, combining flexible architecture, robust CDC capabilities, and automated testing to meet modern big‑data demands.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data Flink Data Integration CDC

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.