How Dada Achieved Seamless Elastic Scaling for Massive Delivery Peaks
Facing surges during holidays and major shopping events, Dada’s DevOps team built a cloud‑native elastic scaling system that combines fine‑grained capacity management, multi‑cloud support, metric‑driven auto‑scaling, and extreme‑scale down strategies, delivering stable delivery performance while cutting costs.
1. Importance of Elastic Capacity and Auto‑Scaling Overview
1.1 Incident that motivated elasticity
During a 2019 promotion the flower‑order service experienced a sudden surge, causing message‑queue back‑pressure. Engineers manually added instances to upstream services A and B, which then over‑requested service C. Service C’s CPU saturated, time‑outs occurred, and the system remained unavailable for over 30 minutes despite multiple scaling attempts. The post‑mortem highlighted the need for automatic capacity planning, reliable scaling SOPs, and fault‑tolerant handling.
1.2 Early Auto‑Scaling prototype
Key enablers:
Apollo for configuration management
Consul for service discovery and high‑availability data sources
OpenResty‑Consul gateway for stateless upstream updates
Each service was assigned a baseline (minimum instance count). Falcon alerts and elastic configuration drove the first version of the AutoScaler.
AutoScaler configuration example:
Minimum instances: default 2, adjustable per service
Scaling rules (CPU water‑level):
> 30 % → add 50 % of current instances
> 50 % → double the instance count
< 5 % → trigger StackStorm to recycle half of the instances, respecting the minimum
Initial deployment automatically expanded during peaks and aggressively contracted during troughs, revealing over‑provisioned capacity and delivering cost balance. CPU‑only metrics proved insufficient, so additional signals such as queue length, connection counts, disk I/O, error logs, and response latency were later incorporated.
2. Intelligent Elastic Architecture Design
2.1 Perceive – Metric collection
Metrics are ingested from multiple back‑ends to support heterogeneous runtimes and clouds:
Falcon – CPU, memory, disk I/O, network packets, variance‑based outlier detection
InfluxDB – Custom middleware stores service‑to‑service, cache, and queue latency statistics
Loki – Aggregates info/error logs into Grafana‑derived metrics (e.g., exception counts)
Prometheus – Core Docker/Kubernetes metrics
OpenTSDB – Unified time‑series format for downstream processing
2.2 Decide – Decision engine
The core engine consists of modular components:
Configuration : self‑service portal where developers define minimum instances, linked services, cloud environment, scaling metrics, thresholds, rates, and toggles
Dashboard : real‑time view of current vs. desired instances, scaling actions, and cost impact
Notification : pushes scaling events to enterprise WeChat and daily summary emails
Aggregator : computes aggregated metrics (e.g., average CPU water‑level per service group)
Collector : normalises time‑series data to OpenTSDB format
Judge : decision logic inspired by Kubernetes HPA algorithm
Periodically the Collector and a Time‑Series Analyzer (TSA) evaluate rules and emit actions (scale‑up, scale‑down, or no‑op). The scaling formula is:
desiredInstances = ceil(currentInstances * (currentMetricValue / desiredMetricValue))Supporting artefacts:
Rule repository stored in CMDB
TSA uses moving averages (MA3, MA5) combined with TP50/TP90 for trend analysis
CMDB + Consul provides service metadata for metric generation
Cache stores recent Collector and Judge results to accelerate predictions
2.3 Execute – Action layer
Execution relies on two abstractions:
Dispatch : runs scaling workflows concurrently, handles retries, logs audit trails, and integrates with scheduled jobs to avoid terminating active nodes
Providers : unified scaling interface supporting various platforms (deployng, TARS, Kubernetes, Serverless, etc.)
The production AutoScaler reliably detects pressure spikes and adjusts instance counts, achieving expected curve‑fitting behaviour.
3. Deployment Practices and Lessons Learned
3.1 Regular scaling drills
Frequent drills validate metric accuracy, multi‑metric coordination, zone‑aware instance distribution, and SOP robustness. Key practices include:
Enforce maximum instance caps per service to prevent connection‑pool exhaustion
Monitor database connections and load; prepare vertical sharding plans
Observe cache, queue, and DB pressure during large‑scale events
Track daily scaling success rate, efficiency, and cost trends
3.2 Extreme shrinkage for cost saving
During low‑traffic windows the system temporarily lowers the minimum instance count and restores it after the window. Example: at 06:00 the system scales up 10 instances with a 100 % success guarantee, using pre‑warming of VMs, base‑image initialization, and network ping tests to improve provisioning reliability.
3.3 Multi‑runtime support
The platform supports VM, VM + Container, Kubernetes, and Serverless runtimes. A custom PID‑1 process (“dinit”)—evolved from dumb_init to a Go implementation—provides:
Process management for multi‑process containers
Traffic control: automatic Consul registration after health checks and graceful traffic drain before termination
In‑place restarts, addressing Kubernetes’s lack of native pod restart capability
These capabilities enable seamless scaling across heterogeneous environments and contribute to significant cost reductions for algorithmic services.
4. Outlook
Future improvements focus on:
Reducing decision latency by predicting capacity needs with a Facebook Prophet‑based Predict module
Enhancing the TSA module for earlier anomaly detection and moving toward fully automated fault self‑healing
Continued collaboration between operations and algorithm teams aims to further balance performance, reliability, and cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
