Evolution of Alibaba Cloud’s Fundamental Network Architecture and the Development of AliNOS
The article traces Alibaba Cloud’s large‑scale data‑center network evolution, explains the motivations behind its self‑developed white‑box platforms, details the design and growth of the AliNOS network operating system built on SONiC, and discusses current capabilities and future challenges such as routing, IPv6, SRv6, and endpoint‑network integration.
Alibaba Cloud has built a massive, self‑developed network platform for its ultra‑large data‑center networks, which has withstood extreme traffic peaks like Double‑11. The AliNOS network operating system, part of the HAIL (Highly Availability, Intelligence, Low‑latency) architecture, is a key enabler of this infrastructure.
Before 2017, the network relied on commercial off‑the‑shelf (COTS) equipment, leading to four major pain points: opaque hardware costs and failure rates, inflexible feature development, fragmented multi‑vendor management, and slow vendor‑driven repair cycles. To address these, Alibaba Cloud adopted a fully self‑designed, white‑box approach, creating modular hardware and integrating it with a custom OS (AliNOS) to achieve high availability, low cost, rapid iteration, and intelligent automation.
The white‑box strategy introduced standardized, modular devices, a DevOps‑driven development process, and an Intent‑Based Networking (IBN) layer that standardizes north‑south interfaces, enabling scalable, stable network operations. End‑to‑end automation further reduces manual intervention.
AliNOS builds on the open‑source SONiC project, which provides a Linux‑based network OS with a full protocol stack (BGP, BFD, LLDP) and a hardware‑agnostic SAI API. Alibaba Cloud contributes back to the community, enhancing SONiC with features such as stacked‑device removal, performance optimizations, and large‑scale RDMA support.
Current AliNOS deployments support Alibaba Cloud’s 5.2 and 6.0 data‑center architectures and are expanding to 7.0, wide‑area networks, edge gateways, and DPU‑based devices. Ongoing challenges include extending the platform from switches to routers (supporting million‑scale L3VPN routes, BGP scalability, and memory management), simplifying the protocol stack with IPv6, and implementing SRv6‑based traffic control.
Future work focuses on endpoint‑network convergence (Predictable Network), leveraging telemetry, SRv6, high‑performance flow control, and DPU integration to make the network a programmable compute platform that delivers high bandwidth and ultra‑low tail latency.
The overarching vision is to keep the network simple, open, and pragmatic, fostering collaboration with open‑source communities to drive global cloud networking advancement.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.