Operations 32 min read

How Alibaba’s ASI Powers Massive Serverless Kubernetes at Scale

This article details Alibaba's Serverless Infrastructure (ASI) built on ACK, explaining its large‑scale Kubernetes architecture, fully managed operations, change‑risk controls, gray‑release pipelines, web‑shell access, taskflow orchestration, node lifecycle management, elasticity, risk mitigation, probing, and self‑healing capabilities that enable reliable cloud‑native services.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Alibaba’s ASI Powers Massive Serverless Kubernetes at Scale

Overview

Alibaba Serverless Infrastructure (ASI) is a unified serverless platform built on Alibaba Cloud Container Service for Kubernetes (ACK) and Alibaba Container Registry (ACR). It provides a “No‑Ops” experience for large‑scale Kubernetes clusters, supporting cloud‑native transformation of Alibaba Group applications and Alibaba Cloud products.

Technical Architecture

ASI extends ACK with enhanced scheduling, workload, networking, node elasticity, and multi‑tenant security. The architecture is organized into four layers:

Meta‑cluster (KOK) : hosts the core control‑plane components shared by all business clusters.

Control‑Plane : kube‑apiserver, kube‑controller‑manager, kube‑scheduler, and etcd.

Add‑Ons : serverless core components, enhanced scheduler, networking, storage, OpenKruise workloads, CoreDNS, etc.

Data‑Plane : node agents such as containerd, kubelet, kata and other plugins.

ASI architecture diagram
ASI architecture diagram

Fully Managed Operations Framework

ASI delivers a fully managed, no‑ops experience for clusters through several core modules:

Unified Change Control : risk‑based change approval and automated rule enforcement.

Cluster Operations : orchestrated large‑scale upgrades, validation, monitoring, and backup.

ETCD Operations : performance tuning and high‑availability management for managed ETCD services.

Component Operations : dedicated teams maintain core components with continuous development.

Node Operations : lifecycle management (provisioning, scaling, maintenance) exposed as a service.

1‑5‑10 Capability : rapid detection, diagnosis, and recovery for incidents affecting thousands of nodes.

Resource Management : capacity planning, OOM prevention, and cost optimization.

Operations framework diagram
Operations framework diagram

Change Risk Control

Change rules are stored in a central repository and enforced via webhooks that invoke business‑level health checks. This prevents unauthorized or risky changes and integrates with the ASIOps system for automated approval.

Dynamic Gray‑Release Pipeline

Static pipelines could not scale to thousands of clusters. The Cluster‑Scheduler evaluates cluster attributes (size, GC tier, resource usage) and generates an optimal release order, eliminating manual pipeline maintenance.

Cluster scheduler diagram
Cluster scheduler diagram

Cluster WebShell Tool

To avoid certificate leakage and uncontrolled local kubectl usage, ASI provides an online web‑shell that grants time‑bound, fine‑grained access to clusters. All actions are recorded for audit, and dangerous operations trigger risk‑control checks and require online approval.

WebShell interface
WebShell interface

Taskflow Orchestration Engine

ASI built a custom Taskflow engine (inspired by open‑source tools) to compose complex operational workflows. It consists of:

PipelineController – maintains task dependencies.

TaskController – tracks task status.

TaskScheduler – schedules execution.

Task/Worker – runs individual executors.

For node expansion, only three executors (expand, initialize, import) are needed; Taskflow links them into a single workflow.

Taskflow architecture
Taskflow architecture

Node Lifecycle Management

The node lifecycle is divided into five stages:

Pre‑production : resource definition and account configuration.

Import : node creation, scaling, and import.

Runtime : component upgrades, batch scripts, CVE patching, self‑healing.

Decommission : cost‑optimized removal.

Fault handling : diagnosis and rapid recovery.

Node lifecycle diagram
Node lifecycle diagram

Elastic Node Capability

ASI leverages Alibaba Elastic Compute Service (ECS) to provision and release nodes within minutes. Optimizations include:

Region‑internal image pulls for daemonsets.

Pre‑installed RPM packages in custom ECS images.

Dedicated bandwidth for yum sources.

Image pre‑heating (≈3 min for 10 GB images).

Node elasticity workflow
Node elasticity workflow

Risk Control Mechanisms

Multiple throttling and protection layers protect the clusters:

KubeDefender : token‑bucket limits on delete operations for critical resources.

UA throttling : QPS caps per User‑Agent.

APF : fair scheduling of apiserver requests.

Risk control diagram
Risk control diagram

KubeProbe Inspection Platform

KubeProbe provides two models for high‑frequency health checks:

Central control : a probe Pod is created per cluster to run custom checks.

Resident Operator : a custom resource drives continuous probes via an Operator.

Hundreds of probes run millions of times per year, detecting >99 % of control‑plane issues early.

KubeProbe architecture
KubeProbe architecture

Self‑Healing System

Beyond pod‑level liveness, ASI adds node‑level self‑healing with enriched diagnosis rules, fine‑grained flow control, and integration with business logic to decide whether to evict workloads or trigger custom recovery actions.

Self‑healing architecture
Self‑healing architecture

Future Outlook

ASI continues to evolve toward fully autonomous Kubernetes clusters, delivering end‑to‑end managed services for clusters, nodes, and components, and sharing operational knowledge with the broader cloud‑native community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeServerlessOperationsKubernetesSREInfrastructure
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.