Big Data 12 min read

How Suning Scaled HDFS with Alluxio: Multi‑Cluster Architecture and Performance Gains

This article details Suning's approach to overcoming HDFS Namenode performance bottlenecks by partitioning into multiple clusters, leveraging Alluxio's unified namespace, and presenting design decisions, implementation challenges, and performance test results that show significant throughput and latency improvements.

Suning Technology

May 11, 2018

How Suning Scaled HDFS with Alluxio: Multi‑Cluster Architecture and Performance Gains

1. Introduction

As Suning's big data platform expands, the HDFS Namenode encounters performance bottlenecks, especially during high‑concurrency overnight tasks, with RPC latency sometimes exceeding 1 second, severely affecting compute performance. This article explains how Suning addressed the issue.

We discuss:

Problems and challenges of a single HDFS cluster and their root causes;

Considerations and comparisons for splitting into multiple clusters;

Using Alluxio's unified namespace to provide a single entry point for multiple HDFS clusters.

2. Bottlenecks and Challenges of a Single HDFS Cluster

Figure 1 shows Suning's big‑data software stack, similar to other vendors. HDFS stores over 1500 Datanodes (48 TB), more than 30 PB used, 180 million files and blocks, about 400 k daily tasks, and over 30 k concurrent Namenode requests during peak hours.

Figure 1: Suning big data platform stack

We run Hadoop 2.4.0; Figure 2 illustrates the HDFS RPC request‑processing flow.

Figure 2: HDFS RPC request and processing flow

High RPC latency stems from:

Write requests need the FSNameSystem write lock, causing long waits under high concurrency;

Synchronous writes of EditLog and AuditLog.

Figure 3 shows a typical day's write‑request latency, staying around 500 ms between 3 am‑5 am.

Figure 3: HDFS Namenode write‑request RPC latency

Optimizations applied:

Reduce HDFS read/write frequency, especially with dynamic partitions;

Make audit log asynchronous;

Set fsImage merge interval to 1 hour;

Disable filesystem access‑time updates.

These improvements helped but did not meet expectations, prompting a multi‑cluster approach.

3. Multi‑Cluster Solution Investigation

Key design questions:

Transparency to users;

Cross‑cluster data access;

Cluster partitioning criteria;

System stability;

Operational complexity.

We evaluated three options:

3.1 Federation + Viewfs

Client‑side configuration solution available in Hadoop 2.4.0.

Pros:

User‑transparent;

Viewfs enables cross‑cluster access;

Proven stability in production.

Cons:

Client‑side configuration scales poorly; more accounts increase operational complexity.

3.2 HDFS Router

Introduced in Hadoop 2.9.0; routes RPC requests via a server‑side table.

Pros: reduces client‑side complexity.

Cons: lacks large‑scale production validation; stability uncertain.

3.3 Alluxio Unified Namespace

Alluxio can mount HDFS paths from multiple clusters into a single namespace, offering user transparency, acceptable operational cost, and proven stability in major internet companies. We selected this approach.

4. Design and Implementation of Alluxio Multi‑Cluster Unified Entry

We partition clusters by account, ensuring an account resides in only one cluster.

Transparency is achieved via same‑path mounting (e.g., /user/bi in each cluster).

During testing we identified three issues and their solutions:

4.1 Metadata Volume

Alluxio Master consumes more memory than Namenode for the same metadata volume, becoming a bottleneck.

Solution: separate metadata management between Alluxio and HDFS. Figure 5 illustrates file‑metadata placement.

Figure 5: File and metadata management example

File A stored only in Alluxio, metadata in Alluxio Master; File B only in HDFS, metadata in Namenode; File C in both, metadata duplicated. Consistency issues are avoided because Alluxio is not used for actual data storage in our scenario.

4.2 Long‑Lived Connections

Alluxio uses Thrift RPC with persistent client‑to‑Master connections, which can be saturated during peak hours.

Solution: clients proactively close idle connections and reconnect as needed. Figure 6 shows the reconnection latency (~1 ms), acceptable for offline clusters.

Figure 6: Client close‑reconnect latency

4.3 Jar Compatibility

We resolve Alluxio client jar conflicts by shading the runtime jars using Maven's shade plugin.

5. Performance Testing

Before rollout we benchmarked single‑cluster versus two‑cluster setups with 20 and 50 concurrent accounts. Figure 7 presents the results.

With two clusters, HDFS TPS increased by 40 % and response time dropped by 22 %.

Future Optimizations

Identified issues to address in Q2:

Clients must store Namenode addresses; we plan to store HDFS configuration in Alluxio Master.

Alluxio Master cannot detect active/standby Namenodes, leading to suboptimal failover; we will improve status awareness.

Postscript

After upgrading to Hadoop 2.8.2 in March, RPC latency fell below 50 ms, but memory usage and write‑lock problems persist, making multi‑cluster solutions still essential for future scale.

Performance optimization Multi-Cluster distributed storage hdfs Alluxio

Written by

Suning Technology

Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.