Big Data 27 min read

How Ctrip Achieved Minute‑Level Real‑Time Analytics with Flink CDC & Apache Paimon

Ctrip transformed its traditional T+1 offline warehouse into a near‑real‑time lakehouse by integrating Flink CDC with Apache Paimon, designing a two‑stage CDC ingestion, optimizing performance, implementing dynamic updates, and deploying the solution across multiple business scenarios, achieving minute‑level latency, reduced costs, and faster data‑driven decisions.

Ctrip Technology
Ctrip Technology
Ctrip Technology
How Ctrip Achieved Minute‑Level Real‑Time Analytics with Flink CDC & Apache Paimon

Introduction

Ctrip’s legacy offline data warehouse operated on a T+1 (day‑level) or T+1H (hour‑level) schedule, which could no longer satisfy the growing demand for near‑real‑time analytics as the business expanded globally. The core pain points were high development and operation costs, fragmented Lambda architecture, and insufficient timeliness.

Architecture Design

The solution centers on Flink as the stream processing engine and Apache Paimon as the lakehouse storage layer. Change data capture (CDC) from MySQL databases is performed by Flink CDC, writing incremental data into an ODS‑level Paimon table. Downstream consumers can choose among several paths:

Real‑time/near‑real‑time ETL : Flink jobs continuously consume Paimon tables and perform stream processing.

High‑speed OLAP queries : Results are materialized or directly accessed via StarRocks.

Ad‑hoc and batch analysis : Trino or Apache Spark query the lakehouse for flexible, large‑scale analytics.

System Architecture Diagram
System Architecture Diagram

Two‑Stage CDC Ingestion

Because Ctrip’s MySQL instances run in a master‑slave‑slavedr topology, the CDC pipeline must respect three constraints: read only from slave nodes, a single thread per MySQL instance for binlog reading, and a maximum of 40 concurrent connections per account. To avoid exhausting the sole binlog reader, a “shared source, independent sink” design was adopted.

The CDC workflow is split into two independent phases:

Source phase : A shared Flink job, managed centrally, reads binlog from the designated MySQL instance and publishes incremental records to Kafka. This job is opaque to end users and guarantees compliant, efficient reuse of DB resources.

Sink phase : User‑managed Flink jobs consume the Kafka topic and write to the target Paimon tables. Users can start, stop, and configure these jobs independently.

The sink supports three synchronization modes:

Full + incremental : On first launch, a snapshot sync is performed, then the job switches to consume incremental data from Kafka.

Pure incremental : The job consumes only incremental changes from Kafka.

Backfill mode : For error recovery, a filtered full + incremental sync is executed.

CDC Synchronization Flowchart
CDC Synchronization Flowchart

Production Optimizations & Challenges

1) Sync‑link performance

Initial deployments suffered from high latency because both deserialization and Kafka writes were single‑threaded. By decoupling binlog parsing from Kafka publishing—using the primary key as the Kafka key for partitioning—the throughput increased nearly tenfold.

Flink Source Optimization Before/After
Flink Source Optimization Before/After

2) Stability & backfill

To guarantee data completeness after job failures, two backfill strategies were evaluated:

Replay archived binlog files (full logical restoration, but requires DBA support and raises security concerns).

Timestamp‑based backfill using a DataLastChange_time column (simpler, but cannot restore physical deletions).

Ctrip adopted the timestamp approach, allowing users to specify a start time and automatically switch to incremental consumption after the historical range is processed.

Backfill Architecture Diagram
Backfill Architecture Diagram

3) Platform usability

Adding new tables previously required restarting the shared source job, causing downstream jitter. A hot‑update mechanism for the table‑name parameter was built, enabling the source to start listening to new tables without a restart.

Flink Source Hot‑Update Mechanism
Flink Source Hot‑Update Mechanism

4) Engine‑side optimizations

Paimon bit‑type conversion : Fixed bugs where float conversion and bit‑to‑boolean mapping were incorrect.

Schema cache : Implemented a one‑minute in‑memory cache for Paimon schemas to avoid costly HDFS lookups during job startup.

Hybrid Source fast switch : Enabled a configuration to start from the last Kafka source after a backfill, ensuring seamless checkpoint compatibility.

Bucket configuration : Provided dynamic bucket step‑size control to balance write performance and storage overhead.

Paimon Schema Cache Diagram
Paimon Schema Cache Diagram

Incremental Computation

Beyond Flink, the lakehouse supports incremental reads via Spark and Trino. Trino can perform real‑time queries, write‑back, and compaction on Paimon tables, delivering low‑latency, high‑concurrency analytics.

Data distribution consistency : Fixed mismatches between Trino’s fixed‑bucket strategy and Paimon’s writer.

BucketFunction thread safety : Ensured the bucket function is safe for concurrent use.

Catalog schema consistency : Centralized schema fetching to avoid divergent views across coordinator and workers.

Application Practices

Business A – Cross‑Timezone Performance Data

The new pipeline reduced ODS‑to‑ADM latency to under five minutes, enabling minute‑level performance dashboards across time zones.

Business A Architecture
Business A Architecture

Minute‑level ingestion (≤5 min, configurable to 1 min).

Reduced pipeline complexity to a single real‑time job.

Lower storage cost thanks to LSM‑based snapshot management.

Business B – Real‑Time Marketing Dashboard

By leveraging Paimon’s partial‑update capability, the solution avoided costly joins and large checkpoint states, delivering a marketing dashboard with minute‑level freshness.

Business B Marketing Dashboard
Business B Marketing Dashboard

Eliminated expensive multi‑stream joins.

Prevented checkpoint bloat and job instability.

Aligned data freshness with overseas staff working hours.

Business C – Minute‑Level Order Attribution

The pipeline integrated seven MySQL tables, performed multi‑stream joins, and delivered order attribution results within eight minutes, an eight‑fold latency reduction.

Business C Attribution Architecture
Business C Attribution Architecture

Results Summary

Data freshness achieved at minute‑level (5‑30 min).

End‑to‑end latency improved from days to minutes.

Native upsert support enabled real‑time order status and user profile updates.

Incremental computation reduced full‑batch resource consumption and cut costs.

Future Plans

Establish a minute‑level SLA framework with multi‑layer monitoring and alerts.

Enhance Paimon table governance (metadata lineage, schema change tracking, automated compaction, data expiration).

Scale the near‑real‑time lakehouse to more core warehousing scenarios, propagating best practices across the organization.

data engineeringFlinkReal-time analyticsPaimonCDC
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.