How Ctrip Business Travel Built a Near‑Real‑Time Lakehouse with Flink CDC & Paimon
This article details Ctrip Business Travel’s implementation of a near‑real‑time data warehouse using Flink CDC and the Paimon lakehouse engine, covering order wide‑table construction, ticket refund alerts, ad attribution, batch‑stream integration, and practical lessons on Partial Update, Aggregation, and Tag‑based incremental processing.
Background
Ctrip Business Travel provides an end‑to‑end travel management platform for enterprise customers, covering flights, hotels, car rentals, and trains. Rapid business growth and richer product forms have increased data volume and complexity, making the previous T+1 offline warehouse insufficient for near‑real‑time analytics. Traditional Kafka‑Flink real‑time warehouses could only support limited scenarios and lacked a direct analysis layer.
Near‑Real‑Time Data Warehouse Construction
Order Wide‑Table Development
The order detail wide table is a core application table for the order management module, originally built with hourly offline jobs. To meet higher freshness requirements, the team evaluated Flink+Kafka but faced instability, high latency, and complex multi‑stream joins. Introducing Paimon solved these issues with its lakehouse‑in‑one architecture, supporting Upsert updates, Partial Update, and dynamic writes, greatly improving stability and development efficiency.
Partial Update Based Order Product Wide Table
Product information tables for train tickets, flight tickets, and generic products share the same primary key (col1, col2). A Paimon wide table with the merge‑engine set to partial‑update and a sequence group to control update order aggregates these streams into a single order product wide table.
Aggregation Based Intermediate Wide Table
When source tables have different primary keys, the team used Paimon's Aggregation engine with the nested_update function to merge streams into a wide table keyed by col1. This approach replaces traditional Group By and enables flexible multi‑dimensional aggregation.
Lookup Join for Dimension Data
Instead of external stores like HBase or Redis, Paimon tables serve as dimension tables for Lookup Join, simplifying the real‑time pipeline and maintaining acceptable performance when dimension data volume is modest.
Ticket Refund Reminder Optimization
The original T‑1 offline job pre‑computed next‑day reminders, causing data delays of over two days. By redesigning the pipeline with Flink CDC and Paimon for near‑real‑time processing, and combining it with a Spark batch job for final filtering, the system achieves minute‑level freshness while ensuring stability.
Ad Order Attribution Real‑Time Reporting
Advertising attribution requires matching orders with click logs from the previous three days. The solution uses Paimon to store seven MySQL source tables, applying Partial Update and Aggregation to avoid costly multi‑stream joins, and employs partition‑level expiration to keep click logs for exactly three days. The final results are written to a Paimon table and reported via an internal SOA service, keeping end‑to‑end latency under eight minutes.
Batch‑Stream Integration Practice
Because Flink’s batch capabilities cannot fully replace Spark, the team adopts a Lambda architecture: Flink CDC writes data to Paimon in real time, Spark reads the same Paimon ODS tables for batch processing, and Flink handles streaming, achieving a practical batch‑stream integration at the storage layer.
Tag‑Based Incremental Computation
Paimon’s Tag feature creates snapshots of tables that can be queried for incremental changes. By creating daily Tags on ODS tables and querying differences between Tags, the team achieves efficient incremental ETL, reducing processing time by 4‑5× compared to full‑load jobs.
Future Plans
The current implementation still follows a Lambda architecture with separate compute and storage layers. Future work will explore deeper integration of Flink and Paimon to achieve true stream‑batch convergence and further simplify the data pipeline.
Configuration Example for Spark to Access Paimon
/opt/app/spark-3.2.0/bin/spark-sql \
--conf 'spark.sql.catalog.paimon_catalog=org.apache.paimon.spark.SparkGenericCatalog' \
--conf 'spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions' \
--conf 'spark.sql.storeAssignmentPolicy=ansi' \
-e "select * from paimon.dp_lakehouse.ods_xxx limit 10;"High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
