How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic
The article details Alibaba's Taobao Personalization Platform (TPP) intelligent scheduling system, explaining its architecture, optimization algorithms, convergence logic, and performance results that dramatically improve CPU utilization and automate scaling during both regular operation and high‑traffic events like Double‑11.
TPP (Taobao Personalization Platform) Overview
TPP supports more than 300 important recommendation scenarios, handling resource allocation and ensuring stable operation across the platform.
Problem Definition & Challenges
Scene owners should not need to manage underlying resources; the platform must maximize CPU utilization while maintaining stability. Manual scheduling incurs huge labor costs, slow response times, and lacks a global view.
Intelligent Scheduling System
The system uses cluster metrics such as CPU usage, load, degraded QPS, current scene QPS, and single‑machine QPS to determine whether to add or remove machines for each scene, automating the scaling process.
Maximize resource utilization during normal operation while guaranteeing service quality.
Rapid and accurate cross‑cluster scaling during large promotions.
Support timed events (e.g., red‑packet rain) with pre‑allocation and automatic rollback.
Second‑second scaling for critical scenes.
System Architecture
The architecture consists of three layers: data input (KMonitor), algorithm decision, and execution. It relies on full containerized deployment, Fiber for second‑level scaling, automatic degradation, and automated load testing.
Algorithm Details
The scheduling problem is formalized as a resource‑allocation optimization. For each scene, the required machine change Ni is computed (positive for expansion, negative for contraction) based on either single‑machine QPS or real‑time metrics.
Calculate Ni using single‑machine benchmark QPS or live performance data.
Iteratively adjust allocations every few seconds, aiming to keep CPU near the optimal target (e.g., 40%).
Constraints ensure total machines are not exceeded, prioritize P1 scenes, and satisfy all P1 expansion demands.
Convergence Logic
The algorithm runs in fixed‑interval iterations (e.g., every 5 seconds), expanding or contracting based on whether a scene’s CPU is outside a steady‑state interval. The process guarantees convergence to a stable state while balancing speed and utilization.
Performance Experiments
Daily Mode
After deploying the intelligent scheduler, average CPU utilization rose from 8 % to 27 %, and the number of machines used dropped to about one‑third of the fixed allocation, while timeout QPS stayed below 0.2 %.
Peak Mode (Double‑11)
During the Double‑11 event, the system performed effective peak‑shaving across four key scenes, keeping P1 resource utilization at 30 %, non‑P1 at 50 %, and timeout QPS under 0.6 %.
Future Improvements
Handle unscheduled traffic spikes with limited buffer pools to reduce cache‑misses.
Integrate intelligent scaling with RR‑layer traffic scheduling for smoother warm‑up and reduced degradation.
Conclusion
The TPP intelligent scheduling system enabled fully automated, zero‑manual‑intervention scaling during Double‑11, achieving high resource utilization, low degradation rates, and allowing engineers to focus on business goals rather than resource management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
