Big Data 23 min read

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

dbaplus Community
dbaplus Community
dbaplus Community
How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

Background

Ctrip’s data foundation platform originally consisted of HDFS storage clusters, YARN compute clusters, and Spark/Hive engines. Version 1.0 was built between 2017‑2021 and continuously optimized, but by 2023 daily data growth exceeded several petabytes, exhausting rack space in two existing IDC data centers.

Challenges in 2022‑2023

Support for a multi‑IDC architecture with cross‑data‑center storage and compute scheduling.

Rapid data growth and long construction cycles for new data halls; cold data needed to be off‑loaded to object storage to relieve capacity pressure.

Insufficient compute resources caused by data volume increase, requiring mixed offline/online resource pools.

Need to upgrade Spark 2 to Spark 3 smoothly.

Overall Architecture (Data Platform 2.0)

The 2.0 architecture adds a multi‑IDC storage layer with hot, warm, and cold tiers, transparent migration, read‑through caching, and a new shuffle service (Celeborn). Scheduling now supports priority‑based dispatch, NodeManager mixing, and offline/online node co‑location. Spark 3 is used as the query engine with Kyuubi as the front‑end.

Data Platform 2.0 architecture diagram
Data Platform 2.0 architecture diagram

Storage Enhancements

1. Multi‑IDC storage upgrade enables data locality reads/writes and eliminates cross‑IDC traffic. 2. Tiered storage (hot on private‑cloud nodes, warm on erasure‑coded nodes, cold migrated to cloud object storage) reduces cost. 3. Transparent migration uses native HDFS tools (Balancer, Mover, Disk Balancer) and, for complex cases, a FastCopy‑based solution that creates hard links on source DataNodes and reports them to the target cluster, moving metadata without copying block data.

When a new IDC is added, the RBF mount‑table feature points new data to a fresh namespace while old data can be gradually migrated or expire via TTL. For full‑scale migration, a customized FastCopy (derived from Facebook’s single‑node version) runs on top of DistCp, handling non‑DISK DataNode types and retrying addBlock RPC failures.

Scheduling Improvements

Priority scheduling tags ETL jobs based on table importance, ensuring P0/P1 tasks meet SLA. NodeManager mixing temporarily replaces idle NodeManagers with Presto/Trino/StarRocks during low‑traffic periods, but retains shuffle service threads to avoid task failures. Offline and online nodes are co‑located, and YARN node labels steer Spark jobs to appropriate clusters.

Remote Shuffle Service (RSS) Celeborn replaces the pull‑style shuffle with a push‑style design, reducing mapper memory pressure, aggregating I/O, cutting shuffle‑read connections from M×N to N, and supporting two‑replica mechanisms to lower fetch‑fail rates. It also decouples shuffle from local disks, enabling Spark on Kubernetes without a local ESS.

Compute Engine Upgrades

Spark 2 was introduced in 2017 and heavily customized for multi‑tenant Thrift Server use. Spark 3.0 (June 2020) added Adaptive Query Execution, dynamic shuffle partition merging, and join optimization. The migration to Spark 3 involved:

Replaying production SQL with Kyuubi’s kyuubi.operation.plan.only.mode=OPTIMIZE to identify incompatibilities.

Ensuring compatibility with Hive SQL, Hive metastore, and Spark 2 SQL.

Extending BasicWriteTaskStats to capture row, file, and size metrics for both partitioned and non‑partitioned tables.

Fixing ORC zero‑size file creation by patching Hive’s OrcOutputFormat in Spark 3.

Extracting Spark 2 custom rules into plugins injected via SparkSessionExtensions for easier future upgrades.

Partition pruning was optimized by first fetching partition names via get_partition_names, filtering with Spark operators, then retrieving details with get_partitions_by_names, reducing pruning time from minutes to seconds. The feature can be enabled with spark.sql.hive.metastorePartitionPruningFastFallback=true.

Data skew detection adds a JoinKeyRecorder to capture key frequencies, feeding a diagnostic platform that reports skewed keys and rows.

Diagnostics and Lineage

A diagnostic robot parses Spark event logs, extracts join keys, and presents a report to users via a chat‑bot interface. Full‑link SQL lineage is captured using spark.sql.queryExecutionListeners, recording server/engine IPs, session IDs, operation IDs, and YARN application IDs. HDFS audit logs are enriched with the SQL execution ID via a custom CallerContext, enabling end‑to‑end tracing of which SQL accessed which files.

Diagnostic robot UI
Diagnostic robot UI
SQL lineage diagram
SQL lineage diagram
HDFS audit log diagram
HDFS audit log diagram

Results and Benefits

Architecture now supports three or more data centers, providing resilience and scalability.

Tiered storage and transparent migration cut storage costs and enable cold data off‑load to the cloud.

Priority scheduling guarantees on‑time completion for critical P0/P1 jobs; mixed offline/online resources add tens of thousands of CPU cores daily.

Spark 2 → Spark 3 upgrade supports over 600 k Spark tasks per day with ~40% speed improvement.

Kyuubi serves as a unified query gateway, handling >300 k queries daily with dynamic scaling.

Alluxio integration yields 30‑50% faster reads for hot data.

Celeborn shuffle service enables finer‑grained offline/online mixing and reduces shuffle‑related failures.

Future work will continue to enrich the data ecosystem, adopt new technology stacks, and further improve cluster stability and processing efficiency to meet growing business demands.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSchedulingdistributed storageHDFSSparkKyuubiceleborn
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.