Cloud Computing 11 min read

Hardware‑Software Integrated Innovations in Cloud Computing: FusionEngine, AI Challenges, and Server Hardware AI

The article reviews recent advances in cloud computing hardware‑software integration, including Alibaba's FusionEngine storage engine, AI‑driven reliability and performance challenges, near‑data network acceleration, and the concept of server hardware AI, while also highlighting related research talks and recruitment notices.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Hardware‑Software Integrated Innovations in Cloud Computing: FusionEngine, AI Challenges, and Server Hardware AI

Hardware development has accelerated dramatically, with storage capacities growing from hundreds of GB to tens of TB and latency dropping from milliseconds to nanoseconds, while network speeds have progressed from 10 Gbps to 100 Gbps and beyond.

In the computing domain, AI proliferation has driven exponential increases in compute capability, demanding equally rapid software performance improvements. Servers, as core infrastructure, must integrate new technologies to deliver extreme performance and reliability through tight hardware‑software co‑design.

At the 2018 Hangzhou Cloud Expo, Alibaba researchers and senior experts presented innovations and engineering practices in hardware‑software integration. Notable speakers included Prof. Li Tao from the University of Florida discussing "Computer Architecture Design Challenges and Opportunities" and Dr. Jiang Li from Shanghai Jiao Tong University on "Challenges and Opportunities for Hardware Reliability in the AI Era".

Hardware‑software integration is positioned as a core competitive advantage for Alibaba Cloud, essential for maintaining market leadership over the next five years.

Section 2 – Near‑Data Computing and Storage Integration

FusionEngine is Alibaba's first large‑scale, user‑space, hardware‑software integrated storage engine, designed for massive data‑center workloads such as Double‑Eleven. It leverages a full user‑space I/O stack, user‑space file system, SSD performance models, and a custom I/O scheduler to unlock SSD potential, delivering up to 5× ESSD performance and 50% higher IOPS.

FusionEngine also improves cost‑performance for Redis on Flash (over 20×) and reduces CPU utilization and remote storage latency for X‑DB workloads.

The engine has evolved to version 2.0, supporting Storage Class Memory, AliFlash, QLC SSD, SMR, and AliFPGA, and offering a suite of storage solutions such as AliFlash V3 ObjectStore, USSCA, GlacierStore, TierStore, and USSKV.

Section 3 – AI Architecture Opportunities and Challenges

Prof. Li Tao highlighted three "A" challenges for AI in the era of big data and IoT: Anywhere (ubiquitous AI), Adaptive (self‑adjusting models), and Autonomous (automated learning), defining the AI 2.0 vision.

Future AI workloads in the cloud require ultra‑low latency and high concurrency, demanding novel hardware (TPU, GPU, NPU) and ecosystem designs that hide underlying complexity from users.

Section 4 – Near‑Network Computing Acceleration

With Moore's Law slowing, Alibaba focuses on architectural innovation for both general‑purpose and heterogeneous computing, employing deep workload tracing, profiling, and custom hardware acceleration to boost network forwarding performance and dramatically reduce end‑to‑end latency.

Section 5 – Server Hardware AI

Alibaba defines "Server Hardware AI" encompassing reliability awareness, performance awareness, energy‑aware management, and intelligent operations. Key technologies include fault isolation/prediction, performance profiling, energy optimization across server‑IDC‑business layers, and data‑driven operation platforms.

Implemented systems such as Service Health Management, Lingjing Performance Diagnosis, Energy Optimization, and Cruiser Intelligent Operations leverage AI algorithms to achieve extreme reliability, performance, and cost‑effectiveness.

Section 6 – AI for Hardware Reliability

Hardware reliability challenges arise from complex integration of chips and boards. AI offers promising solutions for fault detection, anomaly detection, and reliability improvement, though challenges remain such as data sparsity, class imbalance, and high‑dimensional feature spaces.

Shanghai Jiao Tong University has begun applying deep learning to anomaly detection, aiming to enhance system reliability.

Through continuous hardware‑software integration, Alibaba seeks to capitalize on technological dividends, improve competitiveness, and share further innovations and engineering practices.

Recruitment Notices

Infrastructure Business Group – Server Testing and Data‑Driven Expert (Hiring)

Server R&D Division – Hardware‑Software System Optimization and Innovation Expert (Hiring)

Join us by scanning the QR codes below.

Artificial Intelligencecloud computingStorage Engineserver optimizationhardware-software integration
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.