Big Data 19 min read

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

This article details the evolution of Ctrip's flight ticket data warehouse, describing its historical tech stack, current architecture—including Hive, Presto, ClickHouse, CrateDB, and Flink—data synchronization methods, layer design, quality monitoring, and a real‑time price‑monitoring use case.

dbaplus Community

Mar 19, 2020

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

1. Data Warehouse Technology Evolution

Ctrip's flight ticket data warehouse started in 2008 with SQLServer, Informatica, and Kettle for modest data volumes, using SAP BO for reporting. As business complexity grew and Kafka was introduced for log storage, the original stack became unscalable, prompting a shift to Hadoop, Hive, and the Zeus scheduler in 2014.

In 2016, ElasticSearch was added for real‑time log indexing, and Presto was deployed to improve ad‑hoc query performance over Hive. By 2018, ClickHouse and CrateDB were introduced for visualization platforms, reducing four‑dimensional aggregation P90 to 4 seconds. Real‑time processing migrated through Esper, Storm, Spark Streaming, and finally settled on Flink, as illustrated in Figure 1.

2. Current Technology Stack

Production data is categorized into three groups:

Business data stored in MySQL/SQLServer.

Reference data also in MySQL/SQLServer, often cached via Redis.

Log data (append‑only) streamed through Kafka.

Data synchronization targets differ by latency:

Real‑time: ElasticSearch, CrateDB, HBase.

Near‑real‑time (≈T+1 hour) or daily (T+1 day): Hive.

Synchronization mechanisms include:

DB → Hive using Taobao's open‑source DataX (extended by the DP team) with CDC via Canal for MySQL binlogs.

Kafka → Hive originally via Camus; due to performance and monitoring issues, a custom Spark‑SQL‑Kafka tool called hamal was built to decode, decompress, and transform payloads into JSON strings, infer schemas, and generate Hive tables and sync scripts.

Figure 2 shows the overall tech stack.

3. Real‑Time vs. Offline

The current warehouse is primarily offline because flight ticket sales are not fast‑moving consumer goods; real‑time processing demands higher resource stability and incurs significant ROI challenges. However, growing business needs are driving pilot projects for real‑time warehouses in 2020.

4. Common Challenges in Data Warehouse Construction

4.1 Data Synchronization

To achieve comprehensive topic coverage, all production tables and Kafka topics are synced to Hive. Automation is required to generate table creation scripts, sync scripts, and handle schema changes automatically.

Two primary sync scenarios are covered:

DB → Hive : Uses schema metadata and statistics to decide partitioning (historical slice tables for mutable business data, incremental partitions for immutable log data, and full‑load daily for low‑change reference data). Changes are detected via schema service comparisons or DB publish logs.

Kafka → Hive : The legacy Camus tool suffers from MapReduce resource contention, scattered consumption records, and poor lineage visibility. hamal addresses these by consuming Kafka partitions, converting payloads to JSON, inferring Hive schemas, and writing lineage information to a ConsumerRecord table.

Figures 3‑5 illustrate the JSON conversion, Hamal design, and sync flow.

4.2 Data Warehouse Layering

The warehouse follows a four‑layer model: ODS (production mirror), EDW (intermediate), CDM (common data), and ADM (application). Data is cleansed, enriched, and aggregated into wide tables for downstream ad‑hoc queries and reporting.

Figure 6 depicts the layer design.

4.3 Data Parsing Framework

A parsing framework was built to expose high‑level APIs for business developers, allowing them to transform ODS tables with embedded report fields into normalized structures without deep knowledge of underlying big‑data components.

Figure 7 shows the framework architecture.

4.4 Operations Tools

A continuously evolving toolbox supports common repetitive tasks such as entity search, batch report recipient updates, dimension table imports, on‑call logging, script templating, and serialization utilities, significantly boosting data engineer productivity.

5. Data Quality System

A comprehensive quality monitoring system leverages metadata to detect anomalies across millions of entities (tables, topics, indexes). It extracts features from execution logs (Spark, MapReduce) to identify abnormal patterns, correlates them with historical data, and triggers alerts with lineage‑based impact assessment.

Figures 8‑10 illustrate log features, sample logs, and the real‑time monitoring workflow.

6. Application Case: Flight Price Monitoring

Ctrip monitors flight price anomalies by ingesting query and order logs from Kafka, generating feature sets for price‑related anomalies, and automatically disabling suspicious listings when thresholds are exceeded. This system has already identified dozens of erroneous pricing events.

Figure 11 visualizes the price‑monitoring pipeline.

7. Conclusion

A complete data warehouse solution encompasses data synchronization, storage, standards, metadata management, quality assurance, and operational tooling. Teams must tailor each component to their specific context, with Ctrip's flight ticket warehouse continuing to evolve toward more comprehensive, standardized, and real‑time capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Data Quality Data Warehouse ETL Ctrip

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.