How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming
This article details Alibaba Cloud's Flink Cluster Inspector, explaining the business challenges of hotspot machines, the analysis of resource over‑use, and the four‑stage solution—pre‑profiling, in‑process self‑healing, post‑recovery, and observability—that reduces latency, cuts costs, and improves operational efficiency.
1.1 Real‑Time Compute Cluster Status
Hotspot machines have long been a pain point for Alibaba Cloud Flink cluster operations, especially during daily workloads and large‑scale promotions, causing thousands of hotspot events per day and lasting over 60 minutes during peak hours, which severely impacts business and platform stability.
Hotspots affect stability, cost, and efficiency: they increase task latency, prevent the cluster water‑level from rising (harming cost control), and require manual, time‑consuming mitigation.
1.2 Hotspot Issue Analysis
Even when the cluster physical water‑level is moderate, Flink allows tasks to over‑use resources, leading to some Pods exceeding their requested limits. This creates large usage variance among machines, turning certain nodes into hotspots.
The occurrence of hotspots correlates with the degree of task over‑use rather than absolute water‑level.
2 Cost Optimization – Hotspot Handling
2.1 Hotspot Machine Handling Approach
Hotspot handling is divided into two scenarios: "promotion" (large‑scale events) and "daily" operations.
For promotions, proactive profiling is used to predict and avoid over‑use before the event, based on a job‑portrait system.
For daily operations, an in‑process self‑healing capability quickly suppresses hotspots after they appear, complemented by post‑recovery and observability mechanisms.
2.2 Pre‑Profiling
2.2.1 Business Process
Pre‑profiling aims to prevent over‑use during promotions by automatically iterating resource configurations based on target RPS and full‑link pressure testing, without manual tuning.
2.2.2 Implementation
The Autopilot service samples TM resource usage, calculates the 95th percentile during pressure‑test peaks, filters outliers, and generates configurations that meet promotion peak requirements.
2.2.3 Effect
During the Double‑11 promotion, pre‑profiling reduced hotspot occurrences to zero at 65% cluster water‑level and cut costs by 17%, while enabling thousands of automated job optimizations.
2.3 In‑Process Self‑Healing
2.3.1 Technical Choice
Four possible schemes were considered; the dynamic limitation approach using the Inspector was chosen to locally throttle and reschedule over‑using jobs when hotspots arise.
2.3.2 Business Flow
The flow consists of perception, decision, and execution: detecting anomalies, applying a decision tree based on business impact, and performing actions such as eviction or throttling.
2.3.3 Challenges & Innovations
Decision‑tree design balances business priority and rapid hotspot elimination. Innovations include white‑box scheduling (directly assigning Pods to nodes) and dynamic throttling via JM patching of TM pod specs.
2.3.4 Effect
Inspector’s self‑healing suppresses hotspots within three minutes, reduces low‑priority task usage, and lowers latency for high‑priority jobs, as shown in the CPU usage graphs.
2.4 Post‑Recovery
After a hotspot is cleared, throttled jobs are gradually restored to avoid re‑creating hotspots, ensuring stable resource usage and higher cluster water‑level.
2.5 Observability
Two perspectives are covered: SRE‑focused monitoring of Inspector actions and user‑focused dashboards showing job latency or failures caused by Inspector, enabling transparent operation and reducing support tickets.
3 Overall Planning and Future Direction
3.1 Roadmap
Flink Cluster Inspector currently focuses on cost‑related hotspot mitigation (CPU, memory, disk). Future work will extend self‑healing to stability (node and service failures) and efficiency (elastic scaling) using cloud‑native technologies such as Operators, Sidecars, and declarative APIs.
3.2 Top‑Level View
The system abstracts perception, decision, and execution into a unified framework, allowing plug‑in development for various self‑healing scenarios and tight integration with Kubernetes scheduling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
