Big Data 20 min read

Ctrip Flight Ticket Data Warehouse: Architecture, Technology Stack, and Practical Practices

This article outlines Ctrip's flight ticket data warehouse evolution, current big‑data technology stack, data synchronization methods, layered architecture, quality monitoring system, and a real‑time price anomaly detection case, providing practical insights for building scalable, reliable data warehousing solutions.

Ctrip Technology

Feb 20, 2020

Ctrip Flight Ticket Data Warehouse: Architecture, Technology Stack, and Practical Practices

With the rapid development of big data technologies, massive data storage and computing solutions have emerged, and the data warehouse has become a crucial bridge that transports data from production environments to big‑data platforms for processing and then back to production applications or decision‑making systems.

Key quality indicators for a data warehouse include topic coverage, performance, usability, scalability, and data quality; the Ctrip flight‑ticket department continuously strives to improve these aspects.

Technology evolution history : In 2008 the department used SQLServer, Informatica, Kettle, and SAP BO for a small data volume. After production systems introduced Kafka, the limitations of SQLServer became evident. In 2014 Hadoop clusters, Zeus scheduling, and DataX were adopted, moving the warehouse to Hive. By 2016 ElasticSearch was added for real‑time log storage and Presto replaced Hive for ad‑hoc queries. In 2018 ClickHouse and CrateDB were introduced, dramatically reducing aggregation latency, and real‑time processing migrated from Esper, Storm, and Spark Streaming to Flink.

Figure 1: Data Warehouse Technology Evolution

Current technology stack : Production data is categorized into three types—business data (MySQL/SQLServer), base data (MySQL/SQLServer with Redis or local cache), and log data (append‑only, stored in Kafka). Real‑time synchronization targets are ElasticSearch, CrateDB, or HBase; near‑real‑time or daily targets are Hive.

Figure 2: Ctrip Flight Ticket Data Warehouse Stack

Data synchronization :

• DB → Hive : Uses Taobao’s open‑source DataX for schema and statistics extraction, automatic table‑creation scripts, and Canal for binlog capture from MySQL. The system detects schema changes via a metadata service and updates Hive DDL and sync scripts accordingly.

• Kafka → Hive : The original Camus tool suffered from poor YARN resource acquisition, inconvenient offset storage, and difficult lineage tracking. Ctrip built a custom tool called hamal on top of spark‑sql‑kafka to consume Kafka partitions, decode payloads, convert them to JSON, infer Hive schemas, write to Hive tables, and record consumer offsets for lineage and monitoring.

Figure 4: JSON Conversion Example in hamal

Figure 5: hamal Design for Kafka‑to‑Hive Sync

Warehouse layering : Following company data standards, the warehouse is divided into ODS (production mirror), EDW (intermediate), CDM (common data), and ADM (application) layers. ODS data is cleaned and enriched, CDM aggregates into wide tables for ad‑hoc queries and reporting, and business domains such as traffic, revenue, KPI, and assessment are defined.

Figure 6: Warehouse Layer Design

Data parsing framework : To support requests for expanding ODS tables that contain embedded report fields, a parsing framework was built to encapsulate big‑data component APIs, allowing developers to efficiently implement per‑row parsing logic without deep knowledge of the underlying engines.

Figure 7: Data Parsing Framework

Operations tools : A continuously iterated toolset was developed to handle repetitive warehouse operations such as entity search, batch report recipient updates, dimension table import, on‑call registration, script template generation, and serialization/deserialization, greatly improving data‑engineer productivity.

Data quality system : Manual rule‑by‑rule checks are infeasible for a warehouse of this scale. A lightweight, wide‑coverage monitoring system is built on metadata, capturing entity health without imposing heavy compute costs. Critical processes receive additional business‑level checks.

Metadata management : All entities—databases, tables (SQLServer, MySQL, MongoDB), Kafka topics, ElasticSearch indexes, Hive tables—are governed with basic information, lineage, and tags (layer, security level, importance, business domain).

Quality‑related factors : Execution logs (Spark, MapReduce) provide start/end times, duration, status, byte/row counts, and engine parameters. Feature extraction from these logs enables both real‑time anomaly detection and offline statistical analysis. When an anomaly matches predefined rules, lineage information is used to assess impact and trigger alerts.

Figure 8: Quality Feature Extraction

Figure 9: Spark and MapReduce Log Samples

Real‑time monitoring workflow : Real‑time log parsing extracts quality features, which are compared against historical baselines; lineage determines affected entities, and alerts are issued accordingly.

Figure 10: Real‑time Anomaly Monitoring Solution

Application case – price monitoring : Ctrip captures all flight query and order logs from Kafka, synchronizes them to the warehouse in near‑real time, extracts price‑related features, and detects suspicious low‑price trades. When a suspicious pattern exceeds a threshold, the system automatically disables the corresponding flight listings.

Figure 11: Price Monitoring System

Conclusion : A complete data‑warehouse solution should encompass synchronization, storage, standards, metadata, quality assurance, and operational tools, with technology choices tailored to specific team needs. The Ctrip flight‑ticket team continues to refine standards, improve usability, and explore real‑time warehouse implementations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Data Quality Data Warehouse Hive ETL Ctrip

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.