Operations 15 min read

How Alibaba Scales Real‑Time Computing: Evolution of Its Operations Architecture

This article details Alibaba's real‑time computing platform, outlining its operational challenges, the unified automation platform Aquila, proactive fault‑elimination strategies, and ongoing moves toward intelligent, data‑driven management to support massive workloads during events like Double‑11.

Efficient Ops
Efficient Ops
Efficient Ops
How Alibaba Scales Real‑Time Computing: Evolution of Its Operations Architecture

Introduction

The speaker shares the evolution of Alibaba's real‑time computing platform operations architecture, covering four main parts: operational challenges, a unified automation platform, proactive fault elimination, and the path toward intelligence.

Operational Challenges of Real‑Time Computing Platform

Real‑time computing, driven by advances such as AlphaGo, powers search, recommendation, advertising, and monitoring with strict latency requirements (seconds for real‑time, milliseconds for online services). During Double‑11, the platform handled over 1,000 jobs on nearly 10,000 machines, peaking at 4.72 billion QPS.

Key challenges include heterogeneous clusters, hotspot handling, hardware‑software failures, and high resource utilization (70‑90%+).

Cluster heterogeneity due to yearly hardware upgrades.

Hotspot detection and immediate automated response.

Hardware and software fault impact on availability.

Maintaining high resource utilization while ensuring stability.

Unified Operations Automation Platform

Alibaba built a layered platform: the lowest layer manages physical machines and Docker containers via an internal IDC system; above are storage (HDFS, HBase, Pangu), scheduling (YARN, custom Fuxi), and engine layers (Blink‑based engine).

The management layer uses Aquila , an enterprise‑grade service built on open‑source components, to handle stack, configuration, automation, and generic interfaces.

Aquila provides:

Hardware inspection.

Fault prediction using big‑data algorithms.

Automated fault repair and ticketing.

Hardware operation dashboards for utilization, machine onboarding, and failure‑rate analysis.

Key capabilities include:

Screen‑based operations.

Unified operational standards.

Continuous integration.

Full API support for automation.

Design features:

Stack management with versioned service groups.

Configuration management via Git with review and rollback.

Automation workflows for scaling, fault handling, and service auto‑restart.

Generalized interfaces for extensibility.

Aquila’s advantages over open‑source solutions:

HA architecture for servers, ensuring high availability.

Alibaba Cloud database for data safety.

Configuration review process involving development and QA.

Over 100 bug fixes.

HA implementation uses dual servers coordinated by Zookeeper; agents retry on alternate servers if communication fails.

Configuration changes are made via Aquila WebUI or JSON IDE, stored in Git, and reviewed before deployment.

Operational improvements include streamlined stack dependencies, multi‑cluster management, configuration import/export, automatic service registration and Docker deployment, and workflow optimizations (30+ enhancements).

Proactive Fault Elimination

By integrating DAM for hardware fault prediction, Aquila can automatically take services offline, hand machines to DAM for self‑healing, and reintegrate them once repaired, creating a closed‑loop automation.

Resource‑aware auto‑scaling is driven by Tesla monitoring watermarks; when utilization exceeds thresholds, resources are requested from the Sigma scheduler, containers are launched, and Aquila agents report back for final deployment.

Towards Intelligence

Alibaba is exploring intelligent fault analysis using time‑series metrics, ETS/EWMA predictions, and deep‑learning‑based model generation via Sigma.

Unsupervised machine learning pipelines (feature engineering, LOF, DBSCAN) detect anomalies such as mis‑configured machines (e.g., a 64‑core 256 GB machine mistakenly allocated 10 cores 128 GB), which traditional monitoring may miss.

Clustering analysis reveals outlier groups, enabling targeted corrective actions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlibabaBig DataReal‑Time ComputingAquila
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.