Cloud Computing 9 min read

Alibaba Data Center Network Architecture HAIL 5.1: High Availability, De‑stacking, and Low‑Latency RDMA Design

The article describes Alibaba's HAIL 5.1 data‑center network architecture introduced for the 2018 Double‑11 event, detailing its high‑availability de‑stacking design, low‑latency RDMA deployment, and future HAIL 2.0 evolution to support larger‑scale, intelligent, and high‑performance cloud networking.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Data Center Network Architecture HAIL 5.1: High Availability, De‑stacking, and Low‑Latency RDMA Design

For engineers, the annual Double‑11 shopping festival is a massive technical challenge that pushes system performance and stability to their limits; from 2009 to 2018 the transaction volume reached 213.5 billion yuan and order‑creation peaks set new records, demanding a high‑performance, highly available underlying network.

Data‑center networks act as the highway for these large‑scale distributed systems, and higher bandwidth together with lower latency are essential to deliver extreme performance for business workloads.

In 2018, Alibaba’s data‑center network centerpiece was the next‑generation architecture called HAIL 5.1 (High Availability, High Intelligence, Low latency). Core keywords of this architecture include RDMA, de‑stacking, and traffic visualization.

High Availability Key Design – De‑stacking

Traditional dual‑uplink designs rely on stacking two TOR switches, which merges them into a single logical device and introduces complexity and failure points. HAIL 5.1 replaces this with a de‑stacked approach: servers retain dual‑active links to separate TORs, while the network devices implement features that keep all traffic in the third‑layer routing plane, eliminating the need for hardware stacking.

Network‑side features:

ARP entries are converted to /32 host routes on all ASWs.

LACP sysID can be set to allow two ASWs to negotiate LACP with the server’s dual links.

ARP proxy uses the ASW’s MAC address to answer all host or virtual‑host ARP requests, forcing inter‑host traffic to be routed at layer‑3.

Host‑side feature:

Broadcast of ARP packets on the bond‑slave TX direction, enabling both TOR switches to synchronize ARP information.

Low Latency – RDMA Large‑Scale Deployment

Business workloads care about throughput (requests per unit time) and latency (delay per request). Reducing latency directly improves throughput under the same resource constraints.

HAIL 5.1 achieves low latency through two main designs:

Scale‑out design replaces traditional multi‑chip POD‑switches (PSW) with single‑chip devices, cutting the number of forwarding hops inside a POD and thus reducing intra‑POD latency.

RDMA is supported at the POD level, providing industry‑leading high‑performance, low‑latency networking. The simplified single‑chip PSW also eases RDMA deployment.

Alibaba has built the world’s largest RDMA deployment, running on the 5.1 platform for services such as Group DB, Alibaba Cloud ESSD, PolarDB, big‑data PAI, and high‑performance computing. During Double‑11 2018, 100 % of DB and ESSD traffic leveraged RDMA.

The RDMA stack uses RoCEv2 and DCQCN + lossless flow control, and is integrated with a platform‑level RDMA‑Service that offers monitoring, parameter validation, and automated fault detection via gRPC/erspan channels.

Outlook

Alibaba plans to evolve the architecture to a region‑level HAIL 2.0, further enhancing high availability, intelligence, and low latency across a broader scope. The next generation will feature self‑developed switches, custom RDMA flow‑control, and intelligent NICs, continuing to lead next‑generation network visualization and deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityLow latencyRDMAData center
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.