Big Data 18 min read

Design and Implementation of Ctrip's Large-Scale Data Platform

This article details the architectural choices, component selection, performance tuning, and team organization behind Ctrip's big‑data platform, covering Kafka, Presto, Elasticsearch, Gobblin, Zeppelin, REST APIs, and job scheduling to achieve scalable, interactive data analysis and visualization.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Design and Implementation of Ctrip's Large-Scale Data Platform

Author Xu Peng, leader of Ctrip's ticket big‑data platform, introduces his background and the context of the presentation at DAMS 2017.

He outlines three main challenges: selecting suitable technologies for the platform architecture, delivering interactive data analysis experiences, and building a multidisciplinary big‑data team.

1. Data Platform Technology Selection

The overall framework follows a typical pipeline from data sources through message queues, cleaning, and presentation, with specific component choices varying by scenario.

Message queue: Kafka for high‑throughput push/pull consumption.

ETL: LinkedIn's Camus for Kafka‑to‑HDFS synchronization.

Storage: HDFS for batch processing preparation.

Analysis engine: options include Hive, Spark, Presto, Impala; the team chose Presto for its interactive SQL capabilities.

Presto's web UI was built using Airbnb's AirPal as a base, adapted with a custom StatementClient and a jQuery EasyUI front‑end.

Search engine: Elasticsearch for fast, near‑real‑time search, supplemented with the Elasticsearch‑SQL plugin to enable SQL queries.

Web UI and RESTful BigQuery API allow ad‑hoc queries and programmatic access, routing requests to either ES or Presto as appropriate.

2. ETL Pipeline – Gobblin

Gobblin handles ETL, addressing small‑file problems by scheduling partitions to write full HDFS blocks (64 MB or 128 MB). The preferred storage format is ORC for its built‑in indexing, with CarbonData mentioned as an emerging alternative.

3. Analysis Engine – Presto

Presto focuses solely on SQL‑based interactive queries, following the Unix philosophy of doing one thing well, and uses a pipeline architecture where tasks stream results without waiting for stage completion, yielding 5‑20× performance gains over MapReduce.

Comparison with Hive (stable but slow) and Spark (resource‑sharing challenges) highlights Presto's in‑memory execution and stage‑level parallelism.

4. Near‑Real‑Time Search – Elasticsearch

Elasticsearch provides horizontal scalability, high availability, and a rich JSON API, while competing with SolrCloud. Proper shard and replica distribution, OS tuning (file handles, vm.dirty_ratio, I/O scheduler), and index‑level settings (shard count, refresh interval, merge policy) are crucial for performance.

Dashboard screenshots illustrate monitoring of node write/read rates, shard distribution, and cluster health.

5. Data Visualization – Zeppelin

Zeppelin connects to Presto via JDBC, enabling drag‑and‑drop visual reports. Integration with Livy improves Spark resource sharing.

6. Data Micro‑Service – REST Query Interface

A unified BigQuery API provides SQL‑based access with centralized permission management, simplifying auditing and reducing learning overhead.

7. Job Scheduler – Zeus

Zeus (open‑sourced by Ctrip) orchestrates ETL and scheduled tasks, comparable to Airflow, handling data movement from Kafka to HDFS or relational databases.

8. Data Team Capability Building

The team is organized into five areas: engine development, UI design, operations, architecture planning, and language/tool selection, emphasizing cross‑disciplinary skills and continuous learning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchETLPresto
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.