Industry Insights 13 min read

How Alibaba’s µFAB and Solar Redefine Predictable High‑Performance Data Center Networks

This article analyzes Alibaba Cloud’s µFAB and Solar research papers from SIGCOMM 2022, explaining why predictable high‑performance networks are needed, how the two designs achieve bandwidth and latency guarantees through network transparency and host‑network co‑ordination, and presenting theoretical and experimental results that demonstrate significant performance gains.

Alibaba Cloud Infrastructure

Sep 23, 2022

How Alibaba’s µFAB and Solar Redefine Predictable High‑Performance Data Center Networks

Why Predictable High‑Performance Networks Matter

Modern data centers face mounting pressure from rapidly evolving compute hardware (CPU, GPU, TPU, DPU) and latency‑sensitive workloads such as ML/HPC, high‑performance storage, and large‑scale databases, making the network the new performance bottleneck. Applications now demand guaranteed bandwidth, microsecond‑level latency, and low tail latency, especially when resources are pooled (e.g., GPUs, disks, memory) and require 100 Gbps+ links with sub‑10 µs delays.

µFAB: Predictable vFabric on an Informative Data Plane

µFAB aims to provide per‑tenant minimum bandwidth guarantees, maximize link utilization, and ensure low tail latency. Unlike traditional best‑effort networks that treat the fabric as a black box, µFAB leverages programmable data‑plane information (link state, tenant metadata) and feeds it back to the host for intelligent rate control and path selection.

Each tenant receives a virtual fabric with three Service Level Agreements (SLAs): minimum bandwidth, maximal resource utilization, and low tail latency. The host‑side µFAB‑E module sends probe packets to collect network state, while the switch‑side µFAB‑C module aggregates link and tenant information and embeds it in the probes.

Bandwidth‑Latency Guarantee Algorithm

µFAB uses a weight‑based allocation scheme. The sending window for a tenant is calculated as:

window = (tenant_weight / total_weight) * link_capacity_adjustment

where tenant_weight reflects the tenant’s priority, total_weight is the sum of all tenants’ windows maintained by the switch, and link_capacity_adjustment adapts to current link load to maximize utilization while avoiding congestion.

When multiple tenants generate traffic simultaneously, µFAB always allows each tenant to transmit at its guaranteed minimum bandwidth; only excess bandwidth is allocated gradually, preserving SLA guarantees and preventing long‑tail latency spikes.

Path switching is performed when a path’s bandwidth is exhausted or when additional capacity is discovered, with careful throttling to avoid excessive flapping.

Theoretical Analysis and Hardware Experiments

Analytical models show µFAB converges quickly, maintains bandwidth and latency guarantees, and avoids network oscillations during path changes. Implementations on FPGA, SoC NICs, and Tofino switches were evaluated on a three‑tier fat‑tree topology. Results confirm µFAB delivers guaranteed minimum bandwidth, low tail latency, and near‑optimal link utilization even under failures.

Application‑level tests with a latency‑sensitive Memcached tenant and a high‑throughput MongoDB tenant demonstrated up to 2.5× QPS improvement and 21× reduction in tail latency, thanks to intelligent path selection and isolation.

Solar: Storage‑Network Fusion Protocol

Solar extends the predictable network concept to compute‑to‑storage traffic. It offloads storage and network processing to smart NICs, reducing CPU overhead and eliminating protocol‑state bottlenecks. By mapping a jumbo network frame directly to a storage block, Solar eliminates per‑packet‑to‑block state, avoids head‑of‑line blocking after loss, and halves the number of PCIe copies.

Field measurements show Solar reduces storage agent tail latency by 40% and doubles network throughput, while long‑term data indicate a 72% latency reduction and a three‑fold IOPS increase for Alibaba Cloud EBS after Luna and Solar deployment.

Conclusion

Predictable high‑performance networking, exemplified by µFAB and Solar, provides microsecond‑level latency and bandwidth guarantees essential for emerging cloud workloads. µFAB’s host‑network co‑ordination and Solar’s storage‑network fusion together raise service quality, improve resource utilization, and lay a foundation for future compute‑storage convergence in cloud infrastructures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Network Performance Data Center cloud infrastructure µFAB predictable networking Solar

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.