How Alibaba Solves Massive Kubernetes Challenges with OAM and OpenKruise
Alibaba’s journey from early LXC containers to massive Kubernetes clusters reveals scaling, performance, and operational challenges that led to the creation of the Open Application Model (OAM) and OpenKruise, offering a layered, role‑separated approach to cloud‑native application definition and delivery.
Background
Alibaba began containerizing applications in 2011 with LXC, later migrated to Docker and built a large‑scale scheduler. In 2018 the team adopted Kubernetes (K8s) and has since operated dozens of ultra‑large clusters, the biggest with ~10,000 nodes, serving tens of thousands of applications. Alibaba Cloud Kubernetes Service (ACK) also manages tens of thousands of user clusters worldwide.
Challenges in Large‑Scale K8s
The K8s API does not define an explicit “application” concept, mixing concerns of developers, operators, and infrastructure engineers. Fields such as replica or shareProcessNamespace often require clarification, leading to role‑based confusion. Additionally, the ecosystem’s flexibility results in many custom plugins and controllers. For example, the CronHPA plugin introduced a CronHPA CRD to scale workloads by time and CPU, but caused documentation gaps, installation verification issues, and conflicts with the native HPA, ultimately requiring more than 20 admission hooks for validation.
Complex Application Delivery Scenarios
Alibaba must support public clouds, private clouds, hybrid clouds, and IoT environments where APIs are inconsistent. This forces a dedicated delivery team to manually bridge gaps, contradicting the “once packaged, run anywhere” promise of containerization.
Layered Application Delivery Model
Following the CNCF Application Delivery Layered Model, Alibaba defines four layers:
Layer 1: Application definition tools (Helm, Kustomize, CNAB).
Layer 2: Delivery pipelines and GitOps tools (Tekton, Flagger).
Layer 3: Operators and workload components (Deployment, StatefulSet, etc.).
Layer 4: Platform layer that manages underlying infrastructure.
Open Application Model (OAM)
Co‑created by Alibaba and Microsoft, OAM addresses three core problems:
No runtime lock‑in – an application definition can run unchanged in any environment.
Clear role separation – developers see only the API relevant to them, operators see a modular, declarative view.
Definition is split into distinct parts rather than a single monolithic YAML.
OAM Architecture
Component : Describes the application‑side concerns (container image, runtime parameters, workload type). Authored by developers.
Trait : Declares operational capabilities (scaling, ingress, monitoring). Authored by operators.
Application Configuration : Binds Components and Traits together to produce a deployable application. Created by operators or automation.
Components expose a WorkloadType (e.g., Server, Worker, Job) and a list of parameters that developers can mark as overridable by operators. Traits are discoverable via kubectl, specify applicable workload types, and define required and optional fields.
Rudr Implementation
The reference implementation of OAM is the Rudr plugin (written in Rust). Rudr acts as an admission controller that validates OAM CRDs and a controller that translates the high‑level OAM spec into native Kubernetes resources (Deployments, Services, Ingress, etc.) and can also provision external cloud resources such as RDS instances.
OpenKruise
Alibaba open‑sourced OpenKruise , an advanced workload management project that provides capabilities such as cloneSet, sidecarSet, and more, fitting into the third layer of the delivery model.
Benefits of OAM
An OAM YAML file is a self‑contained software package describing containers, parameters, cloud resources, and operational traits.
Enables portable deployments across public clouds, private clouds, hybrid environments, and IoT platforms without rewriting manifests.
Provides role‑based API exposure: developers interact only with Component fields, while operators work with Traits and Application Configuration.
Relevant Open‑Source Resources
Rudr documentation and source code: https://github.com/oam-dev/rudr/tree/master/docs
OpenKruise repository: https://github.com/openkruise/kruise
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
