Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster
In 2013 Alibaba Cloud faced full rack capacity in a single IDC, prompting the development of a multi‑NameNode, cross‑data‑center Hadoop solution that overcomes NameNode scalability, inter‑site bandwidth limits, data placement, job scheduling, massive data migration, and user transparency challenges.
In April 2013 the Alibaba Cloud "Yuntu" (Cloud Ladder) Hadoop cluster ran out of rack space in its IDC, making further expansion impossible; to continue supporting the rapid growth of Alibaba's big‑data workloads, a cross‑data‑center Hadoop architecture was initiated with the clear goal of building a multi‑site cluster.
Technical challenges
Key difficulties included NameNode scalability, inter‑site network bandwidth constraints, data distribution across sites, cross‑site job scheduling, migration of tens of petabytes of data without loss, providing transparent access to users, and extending the solution to three or more data centers.
Solution steps
1. Upgrade the existing cluster to a Federation‑enabled version, treating the original NameNode as NameNode1 with a namespace containing all data (≈5,000 nodes). 2. Deploy a second NameNode ( NameNode2 ) in the same data center with an empty namespace and create a separate BlockPool on every DataNode for it. 3. Migrate roughly 50 % of the metadata and block data from NameNode1 to NameNode2 , achieving a multi‑NameNode layout (see Figure 1). Figure 1: Multi‑NameNode architecture. 4. Bring the second data center (Data Center B) online, allowing its slaves to report to both NameNodes. 5. Physically move NameNode2 to Data Center B, then use a newly developed Master component called CrossNode to copy and rebalance block replicas across the two sites. 6. Implement ViewFS , a client‑side layer that automatically routes file system calls to the appropriate NameNode, making the multi‑NameNode setup transparent to applications (see Figure 2). Figure 2: Client architecture with ViewFS. 7. For cross‑site job scheduling, each data center runs its own MapReduce framework while a new JTProxy component decides, based on job group configuration, which JobTracker should receive the job. JTProxy also provides a unified web UI for monitoring jobs across both trackers (see Figure 3). Figure 3: Cross‑site scheduling architecture. To migrate data without service interruption, the team first copies the full fsimage from NameNode1 to NameNode2 , then builds the block reports on the DataNodes. BlockPools are duplicated using hard links, so both NameNodes see the same physical blocks while only one copy occupies storage. After the migration, each NameNode is configured to manage only its assigned subset of files, and the CrossNode master handles the actual movement of block replicas from Data Center A to B according to a configuration file (see Figure 4). Figure 4: CrossNode architecture. CrossNode reads a list of files/directories that require cross‑site replication, copies the necessary block replicas, and later adjusts replica counts (e.g., from 3:3 to 0:3) to complete the migration, thereby eliminating bandwidth bottlenecks for most jobs. Summary Through these steps the Cloud Ladder Hadoop cluster now spans two IDC sites with nearly ten thousand nodes, offering virtually unlimited scalability, transparent client access, and dynamic data and compute placement. The architecture removes the single‑NameNode performance ceiling, supports future expansion to additional sites, and ensures that Alibaba’s big‑data workloads can grow without fearing data‑size or performance limits. The author, Luo Li (alias "Ghost Li"), was among the first employees of Alibaba’s distributed team and now leads the distributed storage group.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.