How Alibaba Scales DevOps with StarOps: Inside Their Operations Platform
This article explains how Alibaba has evolved its DevOps practice over a decade, detailing the layered architecture of its StarOps suite—including the foundational StarAgent, the Fortress (jump server), the Qingting file‑distribution system, and intelligent AIOps features—showing how automation, scalability, and AI‑driven monitoring enable stable, low‑cost operations for massive workloads such as Double 11.
Alibaba has been promoting DevOps for nearly ten years, aiming to improve collaboration, reduce development costs, and achieve more stable, sustainable business operations.
At the 2017 Hangzhou Cloud Expo, the head of Alibaba's infrastructure operations platform described the evolution of their DevOps system from manual operations to simple tools, then automation, systematization, and finally intelligent, unmanned operations.
The evolution follows a principle: let machines do what they can, avoiding manual work whenever possible, and automating repetitive tasks through a unified operations platform.
The operations capability hierarchy includes multiple layers, each supported by dedicated platforms and systems. These layers handle resource provisioning, high‑availability architecture, cost optimization, runtime changes, horizontal operations (OS updates, network upgrades, DNS, IP changes), and multi‑dimensional monitoring (server, network, IDC, business‑level).
For large‑scale events like Double 11, Alibaba conducts full‑link stress testing and regular fault‑drill exercises to ensure rapid recovery and fault self‑healing.
The StarOps suite consists of two main platforms: the foundational operations platform (StarAgent) and the application operations platform (Normandy). StarAgent provides a unified, secure channel for accessing all servers (physical, virtual, containers) and supports up to 5,000 concurrent users with ISO27001 certification.
StarAgent has scaled from 10,000 to over 100,000 servers, handling daily accesses of over 100 million requests with 99.995% stability. It employs high‑availability designs, regular network‑cut‑off drills, and extensive security policies (command range control, whitelist/blacklist, high‑risk command audit, end‑to‑end encryption).
The platform uses a plugin architecture: plugins can be static scripts or dynamic services, managed by an Agent Core that ensures safe execution and resource limits, automatically updating default plugins and supporting over 150 plugins for monitoring, logging, scheduling, and file distribution.
Qingting is a P2P‑based file distribution system that protects data sources, accelerates delivery, and saves cross‑IDC bandwidth. Tests show it can serve 7,000 clients with ~10 seconds latency, far outperforming traditional distribution.
Normandy, the application operations platform, is a hybrid‑cloud PaaS built on top of the foundational platform. It manages resources (servers, network, storage, databases, middleware) and supports both on‑premise IDC and Alibaba Cloud deployments, enabling seamless scaling during peak events.
Normandy implements Infrastructure as Code, storing resource definitions as JSON‑like code in a CMDB with version control, allowing automatic provisioning, self‑healing, and alignment with actual resource instances.
The suite also provides layered monitoring (IDC, system, business) with intelligent baseline analysis, automated fault detection, root‑cause analysis, auto‑healing, and proactive fault prediction.
Additional features include mobile operations apps (iOS, Android), ChatOps, and AIOps that automate release pipelines, detect anomalies, and enable near‑zero‑touch operations.
In summary, Alibaba’s operations platform emphasizes a solid foundational layer, high stability, extreme efficiency, and a step‑by‑step path toward intelligent, algorithm‑driven automation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
