Big Data 9 min read

Building an Offline‑Online Data Warehouse at Ctrip: Architecture, Goals, and Practices

This article presents Ctrip's practical experience of constructing an offline‑online data warehouse, detailing business pain points, objectives, system architecture, component design, data quality measures, and future directions to achieve scalable, real‑time data processing and management.

Ctrip Technology

Jul 20, 2023

Building an Offline‑Online Data Warehouse at Ctrip: Architecture, Goals, and Practices

The author, Chengrui, a backend development expert at Ctrip, introduces the motivation behind building an offline‑online data warehouse for the travel team, focusing on real‑time data processing, AI platform foundations, and data products.

Business Pain Points include fragmented real‑time development, poor data reuse, disjointed offline‑online pipelines, long production‑service cycles, chaotic tables and tasks, missing lineage and monitoring, and lack of quality control tools.

Business Goals aim to improve efficiency, quality, and management by standardizing data development, enabling minute‑level data deployment for BI users, and providing visual management of lineage, tables, and quality coverage.

System Architecture consists of modules: raw data → data development → data service → data quality → data management, delivering second‑level real‑time processing and minute‑level deployment. Data flows through standardized ETL, traffic forwarding, stream‑batch fusion, and API exposure, with built‑in quality and operation safeguards.

Project Construction covers four main components:

Data Development: includes traffic forwarding tools to unify multiple sources, reduce redundant processing, and standardize data before downstream ingestion.

Data Service: provides synchronization, storage, query, and service capabilities, achieving minute‑level deployment and reducing development effort by 90% while ensuring DQC, resource isolation, and full‑linkage lineage.

Data Quality: addresses content correctness, timeliness, stability, and task reliability through DQC, alerts, and consistency checks using Hudi for offline‑online alignment.

Data Management: offers a visual platform for lineage, basic info, DQC configuration, task status, and monitoring, integrating all previous modules.

Three solution patterns are discussed: simple offline‑online fusion, SQL‑based processing with Flink and Kafka, and a hybrid approach leveraging Flink UDFs and binlog‑driven updates, each with its own advantages and trade‑offs.

Outlook highlights ongoing work to enhance reliability, stability, and usability, including full data‑governance, automated recovery tools, intelligent operation components, and integrated service analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing Flink kafka Data Warehouse Ctrip

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.