How Suning Scaled HDFS with Alluxio: Multi‑Cluster Architecture and Performance Gains
This article details Suning's approach to overcoming HDFS Namenode performance bottlenecks by partitioning into multiple clusters, leveraging Alluxio's unified namespace, and presenting design decisions, implementation challenges, and performance test results that show significant throughput and latency improvements.
1. Introduction
As Suning's big data platform expands, the HDFS Namenode encounters performance bottlenecks, especially during high‑concurrency overnight tasks, with RPC latency sometimes exceeding 1 second, severely affecting compute performance. This article explains how Suning addressed the issue.
We discuss:
Problems and challenges of a single HDFS cluster and their root causes;
Considerations and comparisons for splitting into multiple clusters;
Using Alluxio's unified namespace to provide a single entry point for multiple HDFS clusters.
2. Bottlenecks and Challenges of a Single HDFS Cluster
Figure 1 shows Suning's big‑data software stack, similar to other vendors. HDFS stores over 1500 Datanodes (48 TB), more than 30 PB used, 180 million files and blocks, about 400 k daily tasks, and over 30 k concurrent Namenode requests during peak hours.
We run Hadoop 2.4.0; Figure 2 illustrates the HDFS RPC request‑processing flow.
High RPC latency stems from:
Write requests need the FSNameSystem write lock, causing long waits under high concurrency;
Synchronous writes of EditLog and AuditLog.
Figure 3 shows a typical day's write‑request latency, staying around 500 ms between 3 am‑5 am.
Optimizations applied:
Reduce HDFS read/write frequency, especially with dynamic partitions;
Make audit log asynchronous;
Set fsImage merge interval to 1 hour;
Disable filesystem access‑time updates.
These improvements helped but did not meet expectations, prompting a multi‑cluster approach.
3. Multi‑Cluster Solution Investigation
Key design questions:
Transparency to users;
Cross‑cluster data access;
Cluster partitioning criteria;
System stability;
Operational complexity.
We evaluated three options:
3.1 Federation + Viewfs
Client‑side configuration solution available in Hadoop 2.4.0.
Pros:
User‑transparent;
Viewfs enables cross‑cluster access;
Proven stability in production.
Cons:
Client‑side configuration scales poorly; more accounts increase operational complexity.
3.2 HDFS Router
Introduced in Hadoop 2.9.0; routes RPC requests via a server‑side table.
Pros: reduces client‑side complexity.
Cons: lacks large‑scale production validation; stability uncertain.
3.3 Alluxio Unified Namespace
Alluxio can mount HDFS paths from multiple clusters into a single namespace, offering user transparency, acceptable operational cost, and proven stability in major internet companies. We selected this approach.
4. Design and Implementation of Alluxio Multi‑Cluster Unified Entry
We partition clusters by account, ensuring an account resides in only one cluster.
Transparency is achieved via same‑path mounting (e.g., /user/bi in each cluster).
During testing we identified three issues and their solutions:
4.1 Metadata Volume
Alluxio Master consumes more memory than Namenode for the same metadata volume, becoming a bottleneck.
Solution: separate metadata management between Alluxio and HDFS. Figure 5 illustrates file‑metadata placement.
File A stored only in Alluxio, metadata in Alluxio Master; File B only in HDFS, metadata in Namenode; File C in both, metadata duplicated. Consistency issues are avoided because Alluxio is not used for actual data storage in our scenario.
4.2 Long‑Lived Connections
Alluxio uses Thrift RPC with persistent client‑to‑Master connections, which can be saturated during peak hours.
Solution: clients proactively close idle connections and reconnect as needed. Figure 6 shows the reconnection latency (~1 ms), acceptable for offline clusters.
4.3 Jar Compatibility
We resolve Alluxio client jar conflicts by shading the runtime jars using Maven's shade plugin.
5. Performance Testing
Before rollout we benchmarked single‑cluster versus two‑cluster setups with 20 and 50 concurrent accounts. Figure 7 presents the results.
With two clusters, HDFS TPS increased by 40 % and response time dropped by 22 %.
Future Optimizations
Identified issues to address in Q2:
Clients must store Namenode addresses; we plan to store HDFS configuration in Alluxio Master.
Alluxio Master cannot detect active/standby Namenodes, leading to suboptimal failover; we will improve status awareness.
Postscript
After upgrading to Hadoop 2.8.2 in March, RPC latency fell below 50 ms, but memory usage and write‑lock problems persist, making multi‑cluster solutions still essential for future scale.
Suning Technology
Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
