Cloud Native 17 min read

How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks

Facing surges during holidays and major shopping events, Dada’s DevOps team built a cloud‑native elastic scaling system that combines fine‑grained capacity management, multi‑cloud support, metric‑driven auto‑scaling, and extreme‑scale down strategies, delivering stable delivery performance while cutting costs.

dbaplus Community
dbaplus Community
dbaplus Community
How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks

1. Importance of Elastic Capacity and Auto‑Scaling Overview

1.1 Incident that motivated elasticity

During a 2019 promotion the flower‑order service experienced a sudden surge, causing message‑queue back‑pressure. Engineers manually added instances to upstream services A and B, which then over‑requested service C. Service C’s CPU saturated, time‑outs occurred, and the system remained unavailable for over 30 minutes despite multiple scaling attempts. The post‑mortem highlighted the need for automatic capacity planning, reliable scaling SOPs, and fault‑tolerant handling.

1.2 Early Auto‑Scaling prototype

Key enablers:

Apollo for configuration management

Consul for service discovery and high‑availability data sources

OpenResty‑Consul gateway for stateless upstream updates

Each service was assigned a baseline (minimum instance count). Falcon alerts and elastic configuration drove the first version of the AutoScaler.

AutoScaler configuration example:

Minimum instances: default 2, adjustable per service

Scaling rules (CPU water‑level):

> 30 % → add 50 % of current instances

> 50 % → double the instance count

< 5 % → trigger StackStorm to recycle half of the instances, respecting the minimum

Initial deployment automatically expanded during peaks and aggressively contracted during troughs, revealing over‑provisioned capacity and delivering cost balance. CPU‑only metrics proved insufficient, so additional signals such as queue length, connection counts, disk I/O, error logs, and response latency were later incorporated.

2. Intelligent Elastic Architecture Design

2.1 Perceive – Metric collection

Metrics are ingested from multiple back‑ends to support heterogeneous runtimes and clouds:

Falcon – CPU, memory, disk I/O, network packets, variance‑based outlier detection

InfluxDB – Custom middleware stores service‑to‑service, cache, and queue latency statistics

Loki – Aggregates info/error logs into Grafana‑derived metrics (e.g., exception counts)

Prometheus – Core Docker/Kubernetes metrics

OpenTSDB – Unified time‑series format for downstream processing

2.2 Decide – Decision engine

The core engine consists of modular components:

Configuration : self‑service portal where developers define minimum instances, linked services, cloud environment, scaling metrics, thresholds, rates, and toggles

Dashboard : real‑time view of current vs. desired instances, scaling actions, and cost impact

Notification : pushes scaling events to enterprise WeChat and daily summary emails

Aggregator : computes aggregated metrics (e.g., average CPU water‑level per service group)

Collector : normalises time‑series data to OpenTSDB format

Judge : decision logic inspired by Kubernetes HPA algorithm

Periodically the Collector and a Time‑Series Analyzer (TSA) evaluate rules and emit actions (scale‑up, scale‑down, or no‑op). The scaling formula is:

desiredInstances = ceil(currentInstances * (currentMetricValue / desiredMetricValue))

Supporting artefacts:

Rule repository stored in CMDB

TSA uses moving averages (MA3, MA5) combined with TP50/TP90 for trend analysis

CMDB + Consul provides service metadata for metric generation

Cache stores recent Collector and Judge results to accelerate predictions

2.3 Execute – Action layer

Execution relies on two abstractions:

Dispatch : runs scaling workflows concurrently, handles retries, logs audit trails, and integrates with scheduled jobs to avoid terminating active nodes

Providers : unified scaling interface supporting various platforms (deployng, TARS, Kubernetes, Serverless, etc.)

The production AutoScaler reliably detects pressure spikes and adjusts instance counts, achieving expected curve‑fitting behaviour.

3. Deployment Practices and Lessons Learned

3.1 Regular scaling drills

Frequent drills validate metric accuracy, multi‑metric coordination, zone‑aware instance distribution, and SOP robustness. Key practices include:

Enforce maximum instance caps per service to prevent connection‑pool exhaustion

Monitor database connections and load; prepare vertical sharding plans

Observe cache, queue, and DB pressure during large‑scale events

Track daily scaling success rate, efficiency, and cost trends

3.2 Extreme shrinkage for cost saving

During low‑traffic windows the system temporarily lowers the minimum instance count and restores it after the window. Example: at 06:00 the system scales up 10 instances with a 100 % success guarantee, using pre‑warming of VMs, base‑image initialization, and network ping tests to improve provisioning reliability.

3.3 Multi‑runtime support

The platform supports VM, VM + Container, Kubernetes, and Serverless runtimes. A custom PID‑1 process (“dinit”)—evolved from dumb_init to a Go implementation—provides:

Process management for multi‑process containers

Traffic control: automatic Consul registration after health checks and graceful traffic drain before termination

In‑place restarts, addressing Kubernetes’s lack of native pod restart capability

These capabilities enable seamless scaling across heterogeneous environments and contribute to significant cost reductions for algorithmic services.

4. Outlook

Future improvements focus on:

Reducing decision latency by predicting capacity needs with a Facebook Prophet‑based Predict module

Enhancing the TSA module for earlier anomaly detection and moving toward fully automated fault self‑healing

Continued collaboration between operations and algorithm teams aims to further balance performance, reliability, and cost.

Incident diagram
Incident diagram
AutoScaler architecture
AutoScaler architecture
Scaling rule illustration
Scaling rule illustration
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeOperationsmulti-cloudAuto Scalingelastic scalingcapacity management
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.