Cloud Native 20 min read

Inside Alibaba’s Sigma: How a Cloud‑Native Scheduler Powers 280× Double‑11 Growth

The article details Alibaba’s Sigma scheduling and cluster management platform—its three‑layer architecture, data and state consistency strategies, real‑world case studies, Go‑based redesign, integration with Kubernetes APIs, and lessons on concurrency, high availability, and pod dispersion for massive Double 11 traffic.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Inside Alibaba’s Sigma: How a Cloud‑Native Scheduler Powers 280× Double‑11 Growth

Introduction

Alibaba’s Double 11 sales have grown 280‑fold in transaction volume and over 800‑fold in peak traffic, forcing the underlying systems to scale exponentially. To meet this demand, Alibaba built Sigma, a scheduling and cluster‑management system that started in 2011 and was rewritten in Go in 2016, later adding Kubernetes API compatibility.

Architecture Overview

Sigma consists of three coordinated components: Alikenel (deployed on every physical machine to enhance the kernel, manage resource and time‑slice allocation, and enforce priority policies), SigmaSlave (handles container CPU allocation and emergency scenarios locally), and SigmaMaster (the central brain that performs global resource scheduling, algorithmic optimization, and high‑availability decisions). The design follows a “final‑state” model: requests are persisted, the scheduler determines placement, and slaves enact local deployment, ensuring strong coordination and eventual consistency.

Case 1 – APIServer Design

The APIServer abstracts publishing, scaling, destroying, and upgrading operations, turning them into tasks stored in Redis. Workers consume these tasks statelessly, allowing rapid failover and multi‑master deployment. Consistency is achieved by:

Data consistency via etcd and Redis with both real‑time and full‑sync mechanisms.

State consistency translated into storage consistency to simplify failure handling.

Prioritizing simplicity over perfect engineering solutions.

High‑availability through multi‑master, stateless design and fast failover.

Graceful degradation and resource pre‑emptive allocation for scarce resources.

Unified internal and external workflows to support hybrid cloud deployments.

Case 2 – Scheduler Filtering and Weighting

The Scheduler selects the optimal physical machine for container placement using a two‑stage pipeline: a filtering chain followed by a weighting chain. Because a single rack can contain tens of thousands of nodes, concurrency is crucial. Two concurrency models were evaluated:

Coarse‑grained locking at a global level.

Fine‑grained per‑machine locking, which proved faster despite added complexity.

Tests showed that fine‑grained concurrency yields better performance, especially when handling hundreds of thousands of scheduling requests per day.

Case 3 – Introducing Go in a Java‑Centric Environment

Although Alibaba traditionally favored Java, the team adopted Go for its performance and simplicity. Early Go implementations faced issues such as map iteration bugs and unsafe concurrent access. By iteratively refining the code and aligning it with Kubernetes concepts, the team demonstrated that Go could coexist with existing Java services and eventually contribute back to the open‑source community.

Case 4 – Pod Dispersion Strategy

To avoid single points of failure, Sigma enforces strong pod dispersion across physical machines, chassis, and core switches. The system attempts to spread pods as widely as possible; only when dispersion is impossible does it co‑locate pods. This approach differs from Kubernetes, which enforces strict dispersion, and is essential for Alibaba’s scale where even a 1% outage can affect millions of users.

Code Pitfalls Highlighted

Several common Go pitfalls were identified:

Incorrect use of pointers in map iteration leading to identical values.

Concurrent reads and writes to a shared map without proper locking, causing race conditions.

Goroutine leaks when parent tasks timeout but child goroutines continue, exhausting resources.

Summary

Key takeaways from Sigma’s evolution include:

Architecture decisions should prioritize language‑agnostic design before choosing implementation languages.

Task granularity and concurrency models directly impact performance at massive scale.

Transforming state consistency into storage consistency simplifies reliability.

Understanding Go’s map semantics and goroutine lifecycle is critical for large‑scale systems.

Controlled timeouts and multi‑level concurrency prevent resource leaks.

Pod dispersion is vital for high‑availability in massive clusters.

Open‑source collaboration (e.g., PouchContainer) helps bridge internal innovations with the broader community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityKubernetesGoSchedulercontainer orchestration
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.