Big Data 14 min read

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

This article describes how Ctrip Finance built a unified financial data center by collecting MySQL binlog streams with Canal, transporting them via Kafka, persisting to HDFS with Spark‑Streaming, and merging into Hive tables, while addressing performance, idempotency, delete handling, and data‑quality checks.

Ctrip Technology

Apr 1, 2021

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

Background – In September 2017 Ctrip Finance needed a unified financial data center across multiple data centers to synchronize thousands of MySQL tables to offline and online warehouses. Existing offline sync tool DataX could not meet the cross‑region, low‑latency, high‑accuracy requirements.

Solution Overview – A binlog‑driven data foundation layer was designed, consisting of a web UI for configuration, Canal for binlog capture, Kafka for multi‑region transport, Spark‑Streaming for persisting to HDFS, and a merge process to generate MySQL‑Hive snapshots.

1. Binlog Collection – Canal (Alibaba open‑source) captures MySQL binlog at the instance level, converts raw binlog to a simplified format, and pushes messages to Kafka. HA is achieved via Zookeeper temporary nodes. Kafka topics are created per MySQL instance, using schemaName+tableName as the partition key.

Producer parameters used:

max.in.flight.requests.per.connection=1</code>
<code>retries=0</code>
<code>acks=all

Topic configuration example:

topic partition 3 replicas</code>
<code>min.insync.replicas=2

The simplified binlog message format includes fields such as binlogOffset, executeTime, eventType, schemaName, tableName, source, version, and content.

2. Historical Data Replay – To back‑fill existing data or recover from failures, a mock service generates simple binlog messages from batch MySQL queries and sends them to Kafka, ensuring timestamps are earlier than real‑time data and partitioning is based on executeTime.

3. Write to HDFS – Spark‑Streaming consumes Kafka messages in 5‑minute micro‑batches, writes them to HDFS, and commits offsets only after successful persistence (at‑least‑once guarantee). Data skew is mitigated by splitting large tables into multiple HDFS files using a random suffix.

4. Merge and Snapshot Generation – A daily merge job loads the previous day’s simple binlog partition, checks schema changes, extracts incremental data, and merges it with the existing Hive snapshot using row_number over binlogOffset to keep the latest record per primary key. Delete handling distinguishes normal deletes from archive‑driven bulk deletes via thresholds on volume and age.

5. Data Quality Check – After merge, Hive tables are compared with MySQL tables on key fields (e.g., createTime) for the last 7 days. Discrepancies caused by binlog timing or table archiving are detected and addressed.

6. Additional Governance – The pipeline can incorporate plaintext detection, field standardization, and metadata management to support data governance and lineage.

Conclusion & Outlook – The solution successfully built an ODS layer covering thousands of tables with T+1 latency, a real‑time warehouse on Kudu, and an online cache serving up to 1 million requests per minute. Future work includes one‑click configuration, intelligent operations, and richer metadata services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Big Data data pipeline Binlog

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.