Operations 10 min read

How Alibaba Entertainment Automates Capacity Management and Elastic Scaling

Alibaba Entertainment transformed its capacity management from manual, experience‑based decisions to a fully automated system that continuously evaluates single‑machine performance, identifies performance and success‑rate breakpoints, and drives elastic scaling, dramatically improving resource utilization, availability, and development efficiency across all its applications.

Youku Technology

Jul 16, 2020

How Alibaba Entertainment Automates Capacity Management and Elastic Scaling

Overview

As business models evolve, Alibaba Entertainment needed to upgrade its R&D capabilities to support faster, more stable, and lower‑cost software delivery. Traditional capacity assessment relied on experience and manual decisions, leading to inefficiency and risk. The goal was to first understand the capacity of each application and then enable elastic scaling, ultimately improving overall resource utilization.

Technical Challenges and Solutions

Single‑machine performance

Existing load‑testing methods (link‑level or traffic‑simulation) were insufficient, prompting the development of a custom single‑machine traffic‑driven testing solution that offers:

Closer alignment with real‑world load.

Elimination of manual decision‑making; bottlenecks are identified automatically.

Full automation, e.g., triggering tests after a release.

Elasticity

Relying solely on cluster CPU usage is inadequate; incorporating QPS provides finer‑grained scaling. The solution supports:

Multi‑dimensional, customizable elasticity metrics.

Composite elasticity strategies using conditional orchestration to coordinate multiple metrics.

Technical Scheme

1. Fully Automated Single‑Machine Performance Exploration

The system integrates with each access layer, automatically configures traffic weights, and uses a performance‑turning‑point algorithm to detect and stop tests when a bottleneck is reached, recording the machine’s capacity value. Scheduled daily tests or post‑release tests capture performance changes over time.

2. Elastic Scaling

By linking with the underlying delivery system, a self‑driven resource‑orchestration platform manages planning, delivery, billing, scaling, reclamation, and scheduling. It unifies business resources, discovers idle capacity, shares elastic compute power, and reduces per‑unit cost while ensuring service availability.

3. Combined Framework

The integration of automated performance exploration and elastic scaling is illustrated below:

Seamless application onboarding.

Default "smart exploration" configuration for most scenarios.

Traffic diversion from other machines to the test machine.

Intelligent turning‑point detection using time‑series trend algorithms.

Single‑machine performance prediction (see technical details).

Basic elasticity configuration (see technical details).

Technical Details

1. Single‑Machine Traffic Weight Optimization

Weight adjustments directly control traffic volume; higher weight means more traffic. Automated strategies include weight optimization (smoothly approach maximum weight after a detected turning point) and weight increment (increase traffic when no turning point is found).

2. Response‑Time Turning‑Point Detection

Algorithm: box‑plot based on IQR with multiple k values (upper bound = Q3 + k·IQR). This method reliably locates most turning points, where the value just before a turning point approximates the machine’s capacity.

3. Success‑Rate Turning‑Point Detection

Using the same IQR‑based box‑plot, but with a tighter k because success‑rate is more sensitive; any fluctuation triggers test termination.

4. Reference to Time‑Series Trend Turning‑Point Extraction

The method follows the "Time‑Series Data Trend Turning‑Point Extraction" paper, classifying three‑point sequences into nine possible trends (rise, fall, or steady) and mapping them to appropriate elasticity actions.

5. Single‑Machine Performance Prediction

Features include system metrics, JVM/Tomcat/DB/Redis/queue pool limits. PCA reduces dimensionality, after which a linear regression model with high fit is used for prediction. Prediction parameters must be realistic (e.g., daily CPU range 2‑60 %, prediction cap set to 80 %).

6. Traffic‑Driven Elasticity Scheme

CPU‑based rules: expand when CPU > 60 %, shrink when CPU < 20 %; scaling can be proportional (e.g., 5 % of machines). QPS‑based scaling is also supported, allowing rapid adjustments at hour‑ or minute‑level granularity.

Conclusion

Deep integration of automated capacity management with elastic scaling resolves current capacity estimation challenges, enabling rational resource usage. Users can focus on business‑level capacity planning while the system handles fine‑grained, second‑level elasticity, increasing cluster density and reducing unit costs.

Automation operations Performance Testing elastic scaling capacity management

Written by

Youku Technology

Discover top-tier entertainment technology here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.