Cloud Native 23 min read

How Alibaba Solves Massive Kubernetes Challenges with OAM and OpenKruise

Alibaba’s journey from early LXC containers to massive Kubernetes clusters reveals scaling, performance, and operational challenges that led to the creation of the Open Application Model (OAM) and OpenKruise, offering a layered, role‑separated approach to cloud‑native application definition and delivery.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Alibaba Solves Massive Kubernetes Challenges with OAM and OpenKruise

Background

Alibaba began containerizing applications in 2011 with LXC, later migrated to Docker and built a large‑scale scheduler. In 2018 the team adopted Kubernetes (K8s) and has since operated dozens of ultra‑large clusters, the biggest with ~10,000 nodes, serving tens of thousands of applications. Alibaba Cloud Kubernetes Service (ACK) also manages tens of thousands of user clusters worldwide.

Challenges in Large‑Scale K8s

The K8s API does not define an explicit “application” concept, mixing concerns of developers, operators, and infrastructure engineers. Fields such as replica or shareProcessNamespace often require clarification, leading to role‑based confusion. Additionally, the ecosystem’s flexibility results in many custom plugins and controllers. For example, the CronHPA plugin introduced a CronHPA CRD to scale workloads by time and CPU, but caused documentation gaps, installation verification issues, and conflicts with the native HPA, ultimately requiring more than 20 admission hooks for validation.

Complex Application Delivery Scenarios

Alibaba must support public clouds, private clouds, hybrid clouds, and IoT environments where APIs are inconsistent. This forces a dedicated delivery team to manually bridge gaps, contradicting the “once packaged, run anywhere” promise of containerization.

Layered Application Delivery Model

Following the CNCF Application Delivery Layered Model, Alibaba defines four layers:

Layer 1: Application definition tools (Helm, Kustomize, CNAB).

Layer 2: Delivery pipelines and GitOps tools (Tekton, Flagger).

Layer 3: Operators and workload components (Deployment, StatefulSet, etc.).

Layer 4: Platform layer that manages underlying infrastructure.

Open Application Model (OAM)

Co‑created by Alibaba and Microsoft, OAM addresses three core problems:

No runtime lock‑in – an application definition can run unchanged in any environment.

Clear role separation – developers see only the API relevant to them, operators see a modular, declarative view.

Definition is split into distinct parts rather than a single monolithic YAML.

OAM Architecture

Component : Describes the application‑side concerns (container image, runtime parameters, workload type). Authored by developers.

Trait : Declares operational capabilities (scaling, ingress, monitoring). Authored by operators.

Application Configuration : Binds Components and Traits together to produce a deployable application. Created by operators or automation.

Components expose a WorkloadType (e.g., Server, Worker, Job) and a list of parameters that developers can mark as overridable by operators. Traits are discoverable via kubectl, specify applicable workload types, and define required and optional fields.

Rudr Implementation

The reference implementation of OAM is the Rudr plugin (written in Rust). Rudr acts as an admission controller that validates OAM CRDs and a controller that translates the high‑level OAM spec into native Kubernetes resources (Deployments, Services, Ingress, etc.) and can also provision external cloud resources such as RDS instances.

OpenKruise

Alibaba open‑sourced OpenKruise , an advanced workload management project that provides capabilities such as cloneSet, sidecarSet, and more, fitting into the third layer of the delivery model.

Benefits of OAM

An OAM YAML file is a self‑contained software package describing containers, parameters, cloud resources, and operational traits.

Enables portable deployments across public clouds, private clouds, hybrid environments, and IoT platforms without rewriting manifests.

Provides role‑based API exposure: developers interact only with Component fields, while operators work with Traits and Application Configuration.

Relevant Open‑Source Resources

Rudr documentation and source code: https://github.com/oam-dev/rudr/tree/master/docs

OpenKruise repository: https://github.com/openkruise/kruise

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesDevOpsOAMApplication DeliveryOpenKruiseRudr
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.