Scaling Alibaba's Operations: Inside StarAgent, Qingteng & Normandy
This article details Alibaba's evolution of its operations platform, describing the design, features, and performance of StarAgent, the Qingteng P2P file distribution system, and the Normandy application‑deployment platform, highlighting how these tools enable high‑availability, automation, and massive scalability across global data centers.
Preface
This article originates from the GOPS 2017 Shenzhen conference presentation “Evolution and Construction of Alibaba’s Operations Platform”. The speaker leads the Infrastructure Business Group’s Operations Platform team, responsible for IDC, network, database, big‑data, and application operations.
1. Operations Foundation – StarAgent
StarAgent has been used at Alibaba for over five years; every physical machine, virtual machine, and container runs this agent. It is a platform that manages all agents and provides a plugin system for extending functionality.
Before platformization, each agent bundled all functions into a single executable, causing slow iteration and frequent bugs. Platformization turned each business function into a plug‑in that can be dynamically loaded, improving development speed and reducing deployment complexity.
Key features include:
Command channel supporting synchronous and asynchronous execution, task status queries, and plugin management.
Two plugin types: static scripts/commands and dynamic resident processes with health monitoring.
High availability, high concurrency, security, and self‑operating capabilities aiming for near‑zero human intervention even at million‑server scale.
StarAgent’s architecture is three‑tier: a central control layer, per‑datacenter control servers, and agents with plugins on each machine. The system now handles over 100 million daily internal requests and processes up to 550 k QPS, supporting future growth.
Metrics and monitoring have been introduced to measure stability, performance, and resource usage, enabling data‑driven operations and automated error classification.
2. File Distribution System – Qingteng
Qingteng is a proprietary P2P file‑distribution system developed to replace third‑party solutions that could not meet Alibaba’s diverse scenarios. It supports multi‑threaded download, integrity verification, and whitelist control, optimizing network usage and reducing transfer time.
Performance highlights: ten thousand clients simultaneously download a 500 MB file in about 5 seconds; monthly download volume ranges from 120 k to 300 million, with six‑nine (99.9999 %) availability.
3. Application Operations Platform – Normandy
Normandy is Alibaba’s PaaS platform for application operations, providing Infrastructure‑as‑Code, deployment, and runtime support. It enables fully automated, unattended releases, integrates with middleware, and ensures that failed releases trigger human intervention.
The platform connects testing and production environments, supporting the majority of Alibaba’s applications—approximately 80 % of transaction‑related services.
Overall, these platforms embody the “middle‑platform” concept: a shared, reusable foundation that allows business teams to innovate quickly without reinventing core operational capabilities.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.