Operations 19 min read

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

This article details Alibaba Cloud's Flink Cluster Inspector, explaining the business challenges of hotspot machines, the analysis of resource over‑use, and the four‑stage solution—pre‑profiling, in‑process self‑healing, post‑recovery, and observability—that reduces latency, cuts costs, and improves operational efficiency.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

1.1 Real‑Time Compute Cluster Status

Hotspot machines have long been a pain point for Alibaba Cloud Flink cluster operations, especially during daily workloads and large‑scale promotions, causing thousands of hotspot events per day and lasting over 60 minutes during peak hours, which severely impacts business and platform stability.

Hotspots affect stability, cost, and efficiency: they increase task latency, prevent the cluster water‑level from rising (harming cost control), and require manual, time‑consuming mitigation.

1.2 Hotspot Issue Analysis

Even when the cluster physical water‑level is moderate, Flink allows tasks to over‑use resources, leading to some Pods exceeding their requested limits. This creates large usage variance among machines, turning certain nodes into hotspots.

The occurrence of hotspots correlates with the degree of task over‑use rather than absolute water‑level.

2 Cost Optimization – Hotspot Handling

2.1 Hotspot Machine Handling Approach

Hotspot handling is divided into two scenarios: "promotion" (large‑scale events) and "daily" operations.

For promotions, proactive profiling is used to predict and avoid over‑use before the event, based on a job‑portrait system.

For daily operations, an in‑process self‑healing capability quickly suppresses hotspots after they appear, complemented by post‑recovery and observability mechanisms.

2.2 Pre‑Profiling

2.2.1 Business Process

Pre‑profiling aims to prevent over‑use during promotions by automatically iterating resource configurations based on target RPS and full‑link pressure testing, without manual tuning.

2.2.2 Implementation

The Autopilot service samples TM resource usage, calculates the 95th percentile during pressure‑test peaks, filters outliers, and generates configurations that meet promotion peak requirements.

2.2.3 Effect

During the Double‑11 promotion, pre‑profiling reduced hotspot occurrences to zero at 65% cluster water‑level and cut costs by 17%, while enabling thousands of automated job optimizations.

2.3 In‑Process Self‑Healing

2.3.1 Technical Choice

Four possible schemes were considered; the dynamic limitation approach using the Inspector was chosen to locally throttle and reschedule over‑using jobs when hotspots arise.

2.3.2 Business Flow

The flow consists of perception, decision, and execution: detecting anomalies, applying a decision tree based on business impact, and performing actions such as eviction or throttling.

2.3.3 Challenges & Innovations

Decision‑tree design balances business priority and rapid hotspot elimination. Innovations include white‑box scheduling (directly assigning Pods to nodes) and dynamic throttling via JM patching of TM pod specs.

2.3.4 Effect

Inspector’s self‑healing suppresses hotspots within three minutes, reduces low‑priority task usage, and lowers latency for high‑priority jobs, as shown in the CPU usage graphs.

2.4 Post‑Recovery

After a hotspot is cleared, throttled jobs are gradually restored to avoid re‑creating hotspots, ensuring stable resource usage and higher cluster water‑level.

2.5 Observability

Two perspectives are covered: SRE‑focused monitoring of Inspector actions and user‑focused dashboards showing job latency or failures caused by Inspector, enabling transparent operation and reducing support tickets.

3 Overall Planning and Future Direction

3.1 Roadmap

Flink Cluster Inspector currently focuses on cost‑related hotspot mitigation (CPU, memory, disk). Future work will extend self‑healing to stability (node and service failures) and efficiency (elastic scaling) using cloud‑native technologies such as Operators, Sidecars, and declarative APIs.

3.2 Top‑Level View

The system abstracts perception, decision, and execution into a unified framework, allowing plug‑in development for various self‑healing scenarios and tight integration with Kubernetes scheduling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Flinkcloud-nativeOperationsClusterHotSpotself-healing
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.