Cloud Native 23 min read

How GitOps Powers Cloud‑Native Large‑Scale Cluster Management

This article details Alibaba Cloud's intelligent operations team’s challenges and solutions for managing thousands of cloud‑native clusters, covering their multi‑layered operation architecture, GitOps workflow, infrastructure‑as‑code integration, and the role of AI‑driven intelligent operations in large‑scale environments.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How GitOps Powers Cloud‑Native Large‑Scale Cluster Management

Cloud‑Native Large‑Scale Operations Challenges

The team supports over a thousand cloud‑native clusters across more than ten big‑data and AI products, facing stability, cost, and efficiency trade‑offs while handling diverse node types and high‑frequency deployments (≈500 releases per day).

Frequent releases increase configuration mismatch risks, leading to pod launch failures.

Balancing flexible deployment templates with versioned artifacts is challenging.

Process‑oriented changes can cause service disruptions despite correct desired state.

Choosing between self‑developed tools and open‑source solutions (e.g., Helm) required extensive iteration.

Cloud‑Native Operations Management Practices

The operation solution is layered:

Business products (Flink, DataWorks, PAI) provide application definitions via YAML.

A cloud‑native application platform abstracts these definitions, enabling unified tenant interfaces.

Underlying infrastructure uses Alibaba Cloud ACK clusters, abstracting Kubernetes master management.

A unified node pool supplies resources to both cloud‑native and legacy clusters.

The application model follows the Open Application Model (OAM), separating component topology from implementation, allowing SREs to focus on component instances while developers define deployment intents.

Cloud‑Native GitOps Practice

GitOps is treated as a two‑sided approach: managing desired state and controlling the execution process. The workflow wraps each change in a MergeRequest that remains open until the change is fully executed, ensuring the final state is truly reached.

Change plans are generated from MergeRequest diffs using Infrastructure‑as‑Code scripts (Terraform HCL, Crossplane, Pulumi) that describe both the target and the actions to perform.

Cloud‑Native Intelligent Operations Engineering System

The intelligent operations framework expands six scenarios (delivery, monitoring, management, control, operation, service) and integrates AI agents for both read and write operations, reducing manual effort and improving explainability.

AI agents leverage the unified GitOps change description to interact with various tools, enabling low‑code or DSL‑based automation and enhancing the overall operations lifecycle.

Cloud‑Native Operations Overview
Cloud‑Native Operations Overview
Deployment Frequency
Deployment Frequency
GitOps Process
GitOps Process
cloud-nativeKuberneteslarge scaleGitOpsiac
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.