How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons
This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.
1. Hadoop at Ctrip
Ctrip introduced Hadoop in 2014 and has doubled the cluster size each year. By 2019 it deployed an Erasure Code (EC) cluster based on Hadoop 3 for transparent hot‑cold storage separation, migrating dozens of petabytes and saving half of the storage resources.
The storage layer now holds hundreds of petabytes across thousands of nodes, organized into four federated namespaces with a custom namenode proxy for RPC routing.
On the compute side, Ctrip runs two offline Yarn clusters and one online Yarn cluster, totaling over 150,000 cores and executing more than 300,000 Hadoop jobs daily, 90% of which are Spark jobs. Nodes are spread across four data centers, with offline clusters in two centers and the online cluster in three.
2. Cross‑datacenter project background
Originally Hadoop machines were in data centers A and B, with 95% of nodes in B. In late 2023 Ctrip built a new data center C, while B reached its rack capacity. Projected growth indicated that by the end of 2024 the cluster would exceed ten thousand machines, and new hardware could only be added to C, necessitating a multi‑datacenter architecture.
The two‑center network offers only 200 Gbps bandwidth with 1 ms latency under normal load, rising to 10 ms and 10% packet loss when saturated. Reducing cross‑datacenter bandwidth usage became a primary goal.
3. Architecture options
Ctrip evaluated two solutions:
Multi‑datacenter multi‑cluster : Deploy independent clusters per data center. Advantages: no source‑code changes, simple deployment. Disadvantages: users must configure target clusters, higher operational cost, and data‑consistency challenges when accessing shared data across centers.
Multi‑datacenter single‑cluster : Modify Hadoop core (BlockManager) to make the cluster aware of data‑center topology. Advantages: transparent to users, simpler operations, consistent replica management via a single namenode. Disadvantages: requires risky source‑code changes and thorough testing.
The team chose the single‑cluster approach for its transparency and consistency benefits.
4. Early mixed online/offline trial
Before the final design, Ctrip experimented with a mixed online‑offline deployment. Offline jobs that peaked at night were off‑loaded to an online Kubernetes‑based Yarn cluster spanning data centers A and D. A job‑profiling system collected vcore, memory, shuffle, and HDFS I/O metrics, and a scheduler assigned low‑shuffle jobs to the online cluster.
Label‑based FairScheduler routing ensured that all tasks of an application stayed within a single data‑center, eliminating cross‑datacenter shuffle traffic and reducing offline compute load by about 8%.
5. Chosen single‑cluster solution
The final architecture consists of a single HDFS cluster with data‑center awareness, a federated Yarn deployment, automated migration tools, and bandwidth monitoring.
5.1 HDFS with rack‑and‑room awareness
BlockManager was extended to include a three‑level topology room‑rack‑datanode. Clients now prefer replicas in the same data‑center, dramatically reducing cross‑datacenter reads.
Namespace‑level replica policies allow specifying the number of replicas per data‑center. For paths without explicit policies, a Zookeeper‑backed mapping assigns a default data‑center based on the user’s organization.
During decommission, the ReplicationMonitor prefers source nodes in the same data‑center as the target, limiting bandwidth consumption.
Cross‑datacenter replica settings are persisted in a new EditLog operation and a dedicated FSImage section, ensuring failover safety.
5.2 Balancer & Mover enhancements
Balancer now supports multiple instances, each bound to a specific data‑center IP range, balancing only local datanodes.
Mover was also made multi‑instance; it selects target replica nodes according to the per‑directory cross‑datacenter policy.
5.3 Cross‑IDC Fsck tool
Because namespaces are rolled out gradually, a Cross‑IDC Fsck utility scans block locations, compares them with the configured policies, and repairs misplaced replicas. It queries the standby namenode to reduce load on the active namenode and throttles RPC calls during peak periods.
5.4 Multi‑datacenter Yarn federation with RM proxy
Each data‑center runs an independent Yarn cluster. A custom ResourceManager proxy maintains a user‑to‑data‑center mapping (shared with the namenode via Zookeeper). Submissions first hit the proxy, which forwards the job to the appropriate Yarn cluster, guaranteeing that all tasks of an application run within a single data‑center.
The proxy is multi‑instance and includes a fallback cache; if all proxies fail, the client can route locally based on the cached mapping.
Thrift‑based services (Spark‑Thrift, Presto, Hive) were also wrapped by the proxy, making the redirection transparent to users.
6. Automated migration tools
Migration proceeds at BU‑level granularity in four steps:
Set initial Hive account distribution (e.g., 3:0 between B‑center and C‑center).
Copy databases and user home directories to achieve a 3:3 balance, moving data to the new center.
Migrate accounts and queues to the C‑center.
Monitor cross‑datacenter traffic, then decommission B‑center resources (0:3).
Key precautions include running migrations during low‑traffic windows (10 am–11 pm), throttling transfer rates to protect namenode load, monitoring under‑replicated blocks, and adjusting HDFS quotas dynamically.
Public tables are placed with a 2:2 replica policy across centers, reducing cross‑center read traffic to 20% of its original volume and overall bandwidth usage to roughly 10%.
7. Bandwidth monitoring and throttling
Since HDFS audit logs lack detailed client‑to‑datanode traffic data, Ctrip instrumented the dfsclient and datanode code to report per‑operation byte counts, job IDs, and path information to a throttling service.
The service stores metrics in Elasticsearch and HDFS for analysis, and decides whether to allow a cross‑datacenter read/write based on job priority and current bandwidth usage. Clients must obtain a permit before proceeding; otherwise they back‑off and retry.
After deploying the service and adjusting replica policies, cross‑datacenter block read requests dropped to 20% of the previous level, and bandwidth consumption fell to 10% of its former peak.
8. Summary and future work
The cross‑datacenter transformation delivered:
Room‑aware single HDFS cluster with cross‑datacenter replica control.
Compute scheduling based on RM proxy and Yarn federation.
Automated storage and compute migration tooling.
Real‑time cross‑datacenter traffic monitoring and throttling.
After six months in production, 40% of compute jobs and 50% of data have been migrated to the new center, with bandwidth usage well within limits and users experiencing no disruption.
Future plans include intelligent account‑level migration decisions, finer‑grained replica policies down to partition level, and extending the cross‑datacenter capabilities to the Hadoop‑3 EC cluster.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
