Cloud Computing 7 min read

MagicScaler: Achieving High QoS and Low Cost with Uncertainty‑Aware Autoscaling

The MagicScaler framework, introduced by Alibaba Cloud’s big‑data engineering team and collaborators, combines a multi‑scale attention Gaussian process predictor with an uncertainty‑aware elastic scaling decision engine, delivering significantly higher quality‑of‑service and lower operational costs than traditional autoscaling methods, as demonstrated on real MaxCompute workloads.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
MagicScaler: Achieving High QoS and Low Cost with Uncertainty‑Aware Autoscaling

Opening

Recently, Alibaba Cloud's big data engineering team together with the MaxCompute team, East China Normal University, and DAMO Academy published the paper “MagicScaler: Uncertainty‑aware, Predictive Autoscaling”, which was accepted at VLDB 2023.

Background

As cloud demand grows, allocating resources based on user demand is crucial for stability and cost control. Three typical scaling strategies are shown in Figure 1: Conservative (high cost, low QoS risk), Passive (low cost, high QoS risk), and Predictive Autoscaling, which aims to anticipate demand.

Challenges

Real workloads exhibit high complexity, uncertainty, and granularity‑sensitive temporal dependencies (Figure 2), making accurate demand prediction difficult and posing challenges for proactive scaling.

Solution: MagicScaler

MagicScaler consists of a predictor and a scheduler.

Predictor

The predictor builds a multi‑scale attention Gaussian regression model. It extracts multi‑scale features (MAFE) from historical demand sequences, feeds them into a Gaussian Process Regression (GPR) model, and produces demand forecasts with quantified uncertainty (Figure 4).

Scheduler

The scheduler treats elastic scaling as a Markov Decision Process (MDP). Using rolling‑horizon optimization, it approximates the solution of an infinite‑horizon Bellman equation, balancing resource cost against QoS violation risk (Figure 5).

Evaluation

Experiments on three MaxCompute clusters with real workloads show that MagicScaler outperforms classic autoscaling algorithms in both cost and QoS.

Future Work

Further research will explore integrating MagicScaler with existing MaxCompute scheduling strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Resource ManagementautoscalingGaussian ProcessPredictive Modeling
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.