How GitOps Powers Cloud‑Native Large‑Scale Cluster Management
This article details Alibaba Cloud's intelligent operations team’s challenges and solutions for managing thousands of cloud‑native clusters, covering their multi‑layered operation architecture, GitOps workflow, infrastructure‑as‑code integration, and the role of AI‑driven intelligent operations in large‑scale environments.
Cloud‑Native Large‑Scale Operations Challenges
The team supports over a thousand cloud‑native clusters across more than ten big‑data and AI products, facing stability, cost, and efficiency trade‑offs while handling diverse node types and high‑frequency deployments (≈500 releases per day).
Frequent releases increase configuration mismatch risks, leading to pod launch failures.
Balancing flexible deployment templates with versioned artifacts is challenging.
Process‑oriented changes can cause service disruptions despite correct desired state.
Choosing between self‑developed tools and open‑source solutions (e.g., Helm) required extensive iteration.
Cloud‑Native Operations Management Practices
The operation solution is layered:
Business products (Flink, DataWorks, PAI) provide application definitions via YAML.
A cloud‑native application platform abstracts these definitions, enabling unified tenant interfaces.
Underlying infrastructure uses Alibaba Cloud ACK clusters, abstracting Kubernetes master management.
A unified node pool supplies resources to both cloud‑native and legacy clusters.
The application model follows the Open Application Model (OAM), separating component topology from implementation, allowing SREs to focus on component instances while developers define deployment intents.
Cloud‑Native GitOps Practice
GitOps is treated as a two‑sided approach: managing desired state and controlling the execution process. The workflow wraps each change in a MergeRequest that remains open until the change is fully executed, ensuring the final state is truly reached.
Change plans are generated from MergeRequest diffs using Infrastructure‑as‑Code scripts (Terraform HCL, Crossplane, Pulumi) that describe both the target and the actions to perform.
Cloud‑Native Intelligent Operations Engineering System
The intelligent operations framework expands six scenarios (delivery, monitoring, management, control, operation, service) and integrates AI agents for both read and write operations, reducing manual effort and improving explainability.
AI agents leverage the unified GitOps change description to interact with various tools, enabling low‑code or DSL‑based automation and enhancing the overall operations lifecycle.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
