Operations 17 min read

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

This article details Dada Group’s implementation of an intelligent elastic scaling architecture that automatically adjusts capacity during peak promotions and low‑traffic periods, improving delivery reliability, reducing costs, and supporting multi‑cloud and multi‑runtime environments through sophisticated monitoring and auto‑scaler mechanisms.

Efficient Ops

Apr 20, 2021

How Dada’s Intelligent Elastic Scaling Cuts Costs and Boosts Delivery Performance

1. The Importance of Elastic Capacity and an Initial Look at Auto‑Scaling

During a 2019 promotion, a sudden surge in order volume caused queue backlogs and CPU overloads, leading to manual scaling attempts that failed and resulted in service outages. The incident highlighted the need for proactive capacity planning, automated scaling SOPs, and efficient failure handling.

To address this, Dada adopted an elastic capacity approach that dynamically adjusts resources based on real‑time demand, replacing static capacity planning with adaptive scaling curves.

1.2 Exploring Automatic Scaling

Automatic scaling was enabled by introducing Apollo for configuration management, Consul for service discovery, and an OpenResty+Consul gateway for stateless upstream updates.

Dada defined a baseline instance count for each service and built the first version of the AutoScaler using Falcon water‑level alerts and elastic configurations:

Minimum instances: default 2, adjustable.

Scaling up: for P0/P1 core services, CPU >30% triggers a 50% increase, CPU >50% triggers a 100% increase; for non‑core services, CPU >50% triggers a 50% increase.

Scaling down: CPU <5% triggers an alert and StackStorm reduces instances by 50%, never dropping below the minimum.

The initial AutoScaler automatically expanded head‑service instances during peaks and shrank them during low‑traffic periods, revealing over‑provisioned capacity and achieving cost balance.

However, relying solely on CPU metrics proved insufficient for scenarios such as queue backlogs, connection pool saturation, disk I/O, error logs, or high QPS, prompting the design of a more flexible elastic architecture.

2. Designing an Intelligent Elastic Architecture

Fine‑grained capacity management is crucial for system stability and cloud cost control, with elastic architecture at its core. The design follows a "Perception – Decision – Execution" model.

2.1 Perception – Observing System Metrics

Dada integrates various monitoring sources to collect metrics:

Falcon: CPU, memory, disk I/O, network packets, and variance analysis.

InfluxDB: Middleware request latency and throughput.

Loki: Log‑based metrics such as error class frequencies.

Prometheus: Container and Kubernetes core metrics.

OpenTSDB: Unified time‑series format for aggregated metrics.

2.2 Decision – Core of Elastic Scaling

The decision layer consists of several modules:

Configuration: Self‑service rule definition for minimum instances, linked services, cloud, metrics, thresholds, rates, and switches.

Dashboard: Real‑time visualization of instance counts, desired scaling, and cost.

Notification: Alerts to enterprise WeChat and daily summary emails.

Aggregator: Computes aggregated metrics such as average CPU water‑level per service.

Collector: Normalizes time‑series data to OpenTSDB format.

Juge: Decision engine, inspired by Kubernetes HPA algorithms.

Rule: Central control storing configuration in CMDB.

TSA: Time‑Series Analysis using MA3, MA5, TP50, TP90.

CMDB+Consul: Service metadata for metric calculation.

Cache: Stores historical collector and decision data for predictive modeling.

Decision logic evaluates metrics against configured rules to trigger scaling actions.

2.3 Execution – Enforcing Elastic Capacity

Execution is handled by Dispatch and Providers modules:

Dispatch: Concurrently runs scaling workflows, handles retries, logs, and safeguards against terminating active nodes.

Providers: Abstracts scaling interfaces to support multiple deployment platforms (deployng, Tars, Kubernetes, Serverless, etc.).

The AutoScaler reliably captures rising pressure trends and scales out appropriate instance counts, with observed scaling curves matching expectations.

3. Practical Implementation of Elastic Scaling

During rollout, Dada enhanced the Rule engine with fields and switches to meet developer needs, supporting base and benchmark links, service grouping, multi‑cloud, time‑windowed extreme scaling, and multi‑metric coordination. Example query:

SELECT sum("value") FROM "*_queue_*" WHERE ("queue" = 'XXXX') AND time >= now() - 5m GROUP BY time(1m) fill(null)

Key practices include:

Regular scaling drills to verify functionality, metric accuracy, and multi‑metric coordination.

Ensuring scaling respects availability zone distribution and avoids over‑loading dependent resources.

Implementing rate‑limited shrinking to prevent abrupt capacity drops.

Establishing hard limits on maximum instances per service.

Monitoring database connections and planning vertical sharding.

3.2 Extreme Scaling

During low‑traffic windows, extreme scaling aggressively reduces the minimum instance count, then restores it during peak periods (e.g., adding 10 instances at 06:00 with 100% success guarantee). Optimizations include pre‑warming VMs, using base images, and pinging nodes to reduce provisioning latency.

Container, Kubernetes, and Serverless technologies further improve scaling reliability.

3.3 Supporting Multiple Runtimes

Dada supports VM, VM+Container, Kubernetes, and Serverless runtimes. A custom init process (dinit) provides multi‑process management, traffic control, and in‑place restarts, enabling graceful scaling across diverse environments.

4. Summary and Outlook

The elastic scaling system has been stable for nearly 20 months, delivering daily auto‑scaling notifications via WeChat. Future work includes predictive scaling using Facebook Prophet, enhancing TSA for anomaly detection, and moving toward fully automated fault self‑healing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring cloud-native Operations Auto Scaling capacity management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.