Operations 19 min read

How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets

This article details Alibaba's decade‑long evolution of its real‑time computing platform, the massive operational challenges of managing Flink clusters at million‑core scale, and the comprehensive strategies—including SLA metrics, self‑healing services, cloud‑native redesign, and job‑level advisory tools—used to ensure stability, cost efficiency, and performance during peak events like Double‑11.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Alibaba Scales Flink to Millions of Cores: Real‑Time Ops Secrets

Abstract: The content is compiled from a presentation by Alibaba Cloud real‑time computing senior operations expert Wang Hua at Flink Forward Asia 2021. It covers the evolution history and operational challenges of Apache Flink, cluster‑level operations, and job‑level operations, and shares practical tips and resources.

1. Evolution History and Operational Challenges

Alibaba's real‑time computing has rapidly developed over nearly ten years and can be divided into three eras:

Era 1.0 (2013‑2017): Three real‑time engines co‑existed; JStorm and Blink were still referred to as streaming engines.

Era 2.0 (2017‑2021): The three engines were merged, and Blink became the sole engine, driving massive growth from thousands to tens of thousands of nodes.

Era 3.0 (2021 onward): After acquiring the German Flink company, Alibaba built a cloud‑native VVP platform based on the new open‑source Flink engine, which successfully handled the Double‑11 traffic surge.

Today the platform runs millions of cores, tens of thousands of machines, and tens of thousands of jobs, transitioning from an on‑prem Hadoop‑Flink architecture to a fully cloud‑native Kubernetes‑based deployment.

The operational challenges have evolved with the platform size:

Stage 1 – Platform operations: solving ultra‑large‑scale Flink cluster management for SRE teams.

Stage 2 – Application operations: assisting users with complex Flink job maintenance.

Stage 3 – Cloud‑native era: rapidly advancing operations to cloud‑native and intelligent automation.

2. Cluster Operations (Flink Cluster)

Key workloads such as the Double‑11 GMV dashboard demand extremely high stability. The massive scale (tens of thousands of dedicated machines across regions) makes local anomalies frequent, posing two major stability challenges.

Initially, Flink stability was measured by failure count, which missed many short‑duration issues. A minute‑level SLA metric was introduced, using job state (scheduled, running normally, running abnormally) as the Service Level Indicator (SLI). The SLA availability is calculated as:

SLA Availability = SLA Exception Count × Average Exception Duration

Improving stability focuses on reducing exception count (prevention) and shortening exception duration (quick recovery).

Prevention relies on proactive cluster inspection to eliminate hidden risks, such as sudden large‑scale job launches that overload machines or buggy Flink versions affecting thousands of jobs. A self‑healing service scans all job metrics (latency, failover, back‑pressure) and classifies anomalies into user‑side issues (e.g., OOM, back‑pressure) and platform‑side version problems, automatically notifying users or triggering upgrades.

Recovery is complicated by the sheer number of jobs per cluster. To shorten downtime, a two‑level disaster‑recovery strategy is employed:

Same‑city dual‑datacenter deployment for millisecond‑level failover.

Prioritizing high‑importance jobs for resource protection, with automated priority‑based downgrade or migration.

Transparent job migration using shared storage to keep state consistent across clusters.

3. Application Operations (Flink Job)

With tens of thousands of jobs, operational issues such as slow startup, failover, back‑pressure, latency, and resource cost become complex. Alibaba has distilled its operational knowledge into two products:

Flink Job Adviser : Detects and diagnoses job anomalies.

Flink Job Operator : Automatically repairs identified issues.

Adviser collects lifecycle metrics and logs, builds a decision tree with dozens of rule patterns, and can:

Predict risks before they affect jobs (pre‑emptive detection).

Diagnose issues during execution (e.g., startup errors, performance bottlenecks, data inconsistency).

Provide post‑mortem analysis for historical failures.

Examples include recommending resource expansion for a job that cannot start, indicating platform‑side node failure for a failover, or suggesting memory tuning to avoid OOM.

Operator capabilities are divided into four parts:

Version upgrade and hot configuration updates.

Performance optimization via internal Autopilot.

Cross‑cluster transparent migration for large‑scale job management.

One‑click self‑healing based on Adviser diagnostics.

Overall, the real‑time computing operations have evolved from manual processes to tool‑driven, platform‑driven, intelligent, and cloud‑native solutions, with a focus on stability, cost, and efficiency.

In short, by building an intelligent, cloud‑native control plane, Alibaba addresses the three core challenges of massive Flink platform and job operations.

cloud-nativeApache FlinkSLAReal‑Time ComputingCluster operationsJob Advisory
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.