Operations 21 min read

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

The article details Alibaba's Taobao Personalization Platform (TPP) intelligent scheduling system, explaining its architecture, optimization algorithms, convergence logic, and performance results that dramatically improve CPU utilization and automate scaling during both regular operation and high‑traffic events like Double‑11.

Alibaba Cloud Developer

Dec 19, 2017

How Alibaba’s TPP Intelligent Scheduler Boosts Resource Utilization and Handles Double‑11 Traffic

TPP (Taobao Personalization Platform) Overview

TPP supports more than 300 important recommendation scenarios, handling resource allocation and ensuring stable operation across the platform.

Problem Definition & Challenges

Scene owners should not need to manage underlying resources; the platform must maximize CPU utilization while maintaining stability. Manual scheduling incurs huge labor costs, slow response times, and lacks a global view.

Intelligent Scheduling System

The system uses cluster metrics such as CPU usage, load, degraded QPS, current scene QPS, and single‑machine QPS to determine whether to add or remove machines for each scene, automating the scaling process.

Maximize resource utilization during normal operation while guaranteeing service quality.

Rapid and accurate cross‑cluster scaling during large promotions.

Support timed events (e.g., red‑packet rain) with pre‑allocation and automatic rollback.

Second‑second scaling for critical scenes.

System Architecture

The architecture consists of three layers: data input (KMonitor), algorithm decision, and execution. It relies on full containerized deployment, Fiber for second‑level scaling, automatic degradation, and automated load testing.

Algorithm Details

The scheduling problem is formalized as a resource‑allocation optimization. For each scene, the required machine change Ni is computed (positive for expansion, negative for contraction) based on either single‑machine QPS or real‑time metrics.

Calculate Ni using single‑machine benchmark QPS or live performance data.

Iteratively adjust allocations every few seconds, aiming to keep CPU near the optimal target (e.g., 40%).

Constraints ensure total machines are not exceeded, prioritize P1 scenes, and satisfy all P1 expansion demands.

Convergence Logic

The algorithm runs in fixed‑interval iterations (e.g., every 5 seconds), expanding or contracting based on whether a scene’s CPU is outside a steady‑state interval. The process guarantees convergence to a stable state while balancing speed and utilization.

Performance Experiments

Daily Mode

After deploying the intelligent scheduler, average CPU utilization rose from 8 % to 27 %, and the number of machines used dropped to about one‑third of the fixed allocation, while timeout QPS stayed below 0.2 %.

Peak Mode (Double‑11)

During the Double‑11 event, the system performed effective peak‑shaving across four key scenes, keeping P1 resource utilization at 30 %, non‑P1 at 50 %, and timeout QPS under 0.6 %.

Future Improvements

Handle unscheduled traffic spikes with limited buffer pools to reduce cache‑misses.

Integrate intelligent scaling with RR‑layer traffic scheduling for smoother warm‑up and reduced degradation.

Conclusion

The TPP intelligent scheduling system enabled fully automated, zero‑manual‑intervention scaling during Double‑11, achieving high resource utilization, low degradation rates, and allowing engineers to focus on business goals rather than resource management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba recommendation system Auto Scaling Resource Scheduling cloud operations

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.