How Ctrip Built a Scalable Data Platform with Presto, Elasticsearch, and Gobblin
This article summarizes Xu Peng's DAMS 2017 presentation on selecting big‑data platform components, designing ETL pipelines, choosing analysis engines, optimizing Elasticsearch, and building a data‑driven team at Ctrip.
Xu Peng, leader of Ctrip's ticket big‑data platform, shared his experience at the DAMS 2017 China Data Asset Management Summit, covering three main topics: data platform technology selection, user interaction practice, and big‑data team capability building.
1. Data Platform Technology Selection
Overall Architecture : The platform follows a typical pipeline from data sources → message queue (Kafka) → data cleaning → storage (HDFS) → analysis engines (Hive, Spark, Presto, Impala) → presentation. The key decision is which components fill each layer.
Message Queue : Kafka is chosen for its high throughput and combined push‑pull consumption model.
ETL : Gobblin (or LinkedIn Camus) is used to sync data from Kafka to HDFS, with attention to small‑file problems and partition sizing (64 MB–128 MB). ORC is recommended as the default file format for its built‑in indexing; CarbonData is mentioned as an emerging alternative.
Analysis Engine : Presto is preferred over Hive and Spark for interactive SQL queries. Presto’s pipeline architecture processes data in memory, avoids full stage‑to‑disk shuffles, and supports a simple client‑planner‑scheduler‑worker model. The speaker explains why Presto’s CLI lacks a friendly UI and how they built a custom web UI using jQuery EasyUI and Presto’s StatementClient.
Search Engine : Elasticsearch is used for near‑real‑time search. To simplify queries, the team adopts the Elasticsearch‑SQL plugin, enabling SQL‑style point and range queries. They also discuss the challenges of maintaining Elasticsearch clusters, including OS‑level tuning (file handles, vm.dirty_ratio, I/O scheduler) and Elasticsearch‑level settings (shard distribution, replica management, refresh interval, merge policies).
REST API : A unified BigQuery API provides a single entry point for SQL queries, handling permission control and abstracting underlying engines (Presto, Elasticsearch).
2. ETL Pipeline – Gobblin
The Gobblin pipeline handles data ingestion, partitioning, and file format selection. Small‑file issues are mitigated by controlling partition count and using ORC for its columnar indexing. CarbonData is noted as a newer format with finer‑grained indexes, though performance may lag behind ORC for large volumes.
3. Analysis Engine Comparison – Presto vs. MapReduce
Presto processes data in memory using a pipeline model, allowing stages to overlap and delivering sub‑second latency for interactive queries. In contrast, MapReduce (and Spark) follows a batch model where each stage writes to disk before the next begins, leading to higher latency. Presto is thus suited for ad‑hoc queries, while Hive remains preferable for large‑scale batch reporting.
4. Near‑Real‑Time Search – Elasticsearch
Elasticsearch provides fast, horizontally scalable search based on Lucene. The speaker compares it with SolrCloud, noting Elasticsearch’s easier management and active community. They illustrate typical cluster topology with primary and replica shards and discuss tuning parameters such as shard count, replica factor, refresh interval, and merge settings.
5. Data Visualization – Zeppelin
For visualizing query results, the team evaluates Hue and chooses Zeppelin, which connects to Presto via JDBC and supports drag‑and‑drop chart creation. Integration with Livy is mentioned to improve Spark resource sharing.
6. Data Micro‑services – REST Query Interface
The BigQuery API standardizes query access, enforces permission checks, and uses familiar SQL syntax to avoid steep learning curves for users.
7. Job Scheduler – Zeus
Task scheduling is handled by the open‑source Zeus system (https://github.com/ctripcorp/dataworks-zeus). It orchestrates ETL jobs, periodic tasks, and data sync from Kafka to HDFS or relational databases. Competing solutions like Airflow are briefly mentioned.
8. Data Team Capability Building
The team’s capability development is divided into five aspects: engine development (high technical barrier), UI/UX design (ensuring usability), operations (OS and backend expertise), cross‑functional skill exchange between development and ops, and architectural/PM roles for platform planning.
Throughout the talk, the speaker answers audience questions about cluster size (≈10 nodes, 128 GB RAM, 10 Gbps NIC), query latency, connector implementations for Elasticsearch and CarbonData, and performance considerations when loading millions of records.
Overall, the presentation provides a practical roadmap for building a modern big‑data platform, emphasizing component selection, performance tuning, and team organization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
