Cloud Computing 21 min read

DeepScaling: An Automated Capacity Evaluation System for Stable CPU Utilization in Large‑Scale Cloud Services

DeepScaling is a deep‑learning‑driven autoscaling framework that predicts workload, estimates CPU usage, and makes reinforcement‑learning‑based scaling decisions to keep microservice CPU utilization at a target level, thereby reducing resource waste while meeting SLOs in large‑scale cloud environments.

AntTech
AntTech
AntTech
DeepScaling: An Automated Capacity Evaluation System for Stable CPU Utilization in Large‑Scale Cloud Services

Online service providers such as Google, Facebook, Ant Group, and Tencent often adopt conservative resource allocation policies that keep CPU utilization low to avoid SLO violations, which leads to significant resource and energy waste.

To address this, Ant Group developed DeepScaling, an automated capacity‑evaluation system that stabilizes CPU utilization at a target level using deep learning, ensuring SLO compliance while reducing unnecessary resource consumption.

System Overview

DeepScaling consists of several modules: a Load Balancer that evenly distributes requests, a Service Monitor that collects per‑service metrics, a Workload Forecaster that predicts future workload using a spatio‑temporal graph neural network (STGNN), a CPU Utilization Estimator that models the non‑linear relationship between workload and CPU usage with a deep probabilistic regression network, a Scaling Decider that employs a model‑based DQN to generate scaling actions, a Target Level Controller that maintains the CPU target water‑mark with a safety buffer, an SLO Monitor that detects response‑time or error violations, an Instance Controller that adds or removes pods, and a Vertical Pod Autoscaler (VPA) for per‑service resource tuning.

Model Details

The Workload Forecaster captures inter‑service traffic dependencies via STGNN, achieving higher accuracy over traditional time‑series models. The CPU Estimator ingests seven workload metrics plus service ID, timestamp, and instance count to predict CPU consumption even under periodic background tasks. The Scaling Decider uses a DQN with a custom loss that penalizes deviation from the target CPU level, selecting actions to increase, decrease, or keep the instance count unchanged.

Evaluation

DeepScaling was compared against rule‑based scaling, FIRM (RL‑based), and Google Autopilot on two representative microservices. It consistently kept CPU utilization closer to the desired SLO boundary, achieving up to 24.6% higher resource‑saving efficiency (RRU) and 14.0% better SLO compliance (RCS) than the best baseline.

Online deployment since mid‑2021 has saved on average 30 k core‑days and 60 k GB‑days of memory per day for Ant Group’s payment services.

Conclusion and Future Work

DeepScaling demonstrates that deep‑learning‑driven autoscaling can balance resource efficiency with strict SLO guarantees in large‑scale cloud systems. Future research will explore heterogeneous machine‑type modeling to further refine pod‑count and instance‑type recommendations.

Cloud Computingmicroservicesdeep learningresource managementautoscaling
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.