How We Built DAGOR: A Scalable Overload Control System for Massive Microservices
This article presents DAGOR, a decentralized overload control framework designed for large‑scale microservice architectures like WeChat’s backend, detailing its service‑agnostic design, priority‑based admission policies, adaptive algorithms, and experimental evaluation that demonstrates improved success rates, fairness, and robustness under heavy load.
Abstract
Effective overload control for large‑scale online services is crucial to prevent backend overload. Traditional overload control targets individual services, but complex dependencies can cause system‑wide damage. DAGOR is a service‑agnostic, system‑centric overload control scheme for account‑oriented microservice architectures. It monitors load at the microservice granularity, triggers load shedding collaboratively, and has been deployed in WeChat’s backend for over five years, achieving high service success rates and fairness under overload.
Keywords
overload control, service admission control, microservice architecture, WeChat
1. Introduction
Overload control mitigates unresponsive services during spikes, essential for 24×7 online applications. Cloud elasticity alone cannot solve overload because providers are limited by purchased resources, requiring backend overload control.
Traditional overload control assumes few components and simple dependencies, focusing on OS, runtime, or application level. Modern services adopt SOA and microservice architectures with hundreds to thousands of services, making traditional designs insufficient.
Large microservice systems face three challenges: (1) all services must be monitored to avoid cascading overload; (2) combinatorial overload can drastically reduce throughput; (3) centralized admission control slows updates and coordination, so a decentralized, adaptive solution is needed.
2. Background
2.1 WeChat Service Architecture
WeChat’s backend is built on a microservice DAG where vertices are services and edges are call paths. Services are classified as basic (no outgoing edges) or leap (non‑zero outgoing edges). Entry services have zero indegree; all requests start at an entry service and end at a basic service.
WeChat processes billions of requests per second, with daily entry‑service request counts ranging from 10¹⁰ to 10¹¹.
2.2 Service Deployment
Over 3,000 services run on 20,000 machines, with frequent updates (≈1,000 changes per day). Centralized or SLA‑based overload control cannot keep up with such rapid evolution.
2.3 Dynamic Workload
Workloads vary dramatically; peak traffic can be 3× daily average, and up to 10× during events like Chinese New Year. Over‑provisioning is uneconomic, so adaptive overload control is required.
3. Overload in WeChat
Common overload causes include traffic spikes, service version changes, network failures, configuration changes, bugs, and hardware faults. Three basic overload forms are identified (see Figure 2): simple overload of a single service, and two combinatorial overloads where a service is called multiple times or multiple overloaded services are involved.
3.1 Overload Scenarios
Form 1: a basic service M overloads, affecting upstream services. Form 2: a service A calls M multiple times (subsequent overload). Form 3: upstream service calls multiple overloaded services. Combinatorial overload dramatically reduces throughput because success requires all downstream calls to succeed.
3.2 Challenges of Large‑Scale Overload Control
Unlike traditional three‑tier systems, WeChat has no single entry point, and request paths vary per request, making global admission control ineffective. Decentralized overload control must also avoid excessive request aborts that waste bandwidth and CPU.
4. DAGOR Overload Control
DAGOR satisfies the requirements with three design goals:
Service‑agnostic: Works for any microservice without needing service‑specific information.
Independent yet collaborative: Each server makes local decisions, while upstream servers cooperate via shared admission levels.
Efficient and fair: Minimizes wasted computation and ensures similar throughput for upstream services regardless of how many times they call overloaded downstream services.
4.1 Overload Detection
DAGOR monitors the average waiting time of requests in the server queue (not response time) because waiting time reflects server load without being polluted by downstream latency. A sliding window of either 1 s or 2,000 requests triggers an update. The overload threshold is 20 ms; the default request timeout is 500 ms.
4.2 Service Admission Control
4.2.1 Business‑oriented Admission
Requests receive a business priority (e.g., login > payment > instant messaging). All downstream requests inherit this priority, and low‑priority requests are shed first during overload.
4.2.2 User‑oriented Admission
User priority is generated by a per‑hour hash of the user ID, providing fairness across users while allowing occasional priority changes.
4.2.3 Session‑oriented Admission
Similar to user‑oriented, but based on session ID. In practice, users can game the system by logging out and back in, so DAGOR prefers user‑oriented admission.
4.2.4 Adaptive Admission
DAGOR adjusts the composite admission level (B, U) based on load. It maintains a histogram of accepted requests per (B,U) pair and uses a binary‑search‑like algorithm to find the highest level that keeps accepted requests below the expected load.
4.2.5 Collaborative Admission
Downstream servers attach their current admission level to responses; upstream servers use this information to locally reject requests that would be dropped downstream, reducing wasted network traffic.
4.3 Overload Control Workflow
Client request arrives at an entry service, which assigns business and user priorities.
Service invokes downstream services, propagating the priorities.
Each service applies admission control based on its current level and adjusts the level periodically.
Downstream responses carry the current admission level.
Upstream services update their records with the received level.
5. Evaluation
5.1 Experimental Setup
Experiments run on an internal cluster (Intel Xeon E5‑2698, 64 GB RAM, 10 GbE). Workloads simulate an encryption service (M) that saturates at ~750 QPS and a messaging service (A) that calls M multiple times. Four workload types (M1–M4) represent simple and combinatorial overload.
5.2 Overload Detection
DAGOR‑q (queue‑time based) is compared with DAGOR‑r (response‑time based). Results show DAGOR‑q delays shedding until higher input rates, avoiding false overload alarms.
5.3 Service Admission Control
DAGOR is compared with CoDel, SEDA, and a random shedding baseline. Under combinatorial overload (M2–M4), DAGOR achieves ~50 % higher success rates and approaches the theoretical optimum.
5.4 Fairness
Using a mixed workload of M1–M4, DAGOR maintains similar success rates across overload types, while CoDel favors simple overload, demonstrating DAGOR’s fairness.
6. Related Work
Prior overload control research focuses on databases, stream processing, sensor networks, and web services, which target single‑service architectures. DAGOR is the first to address overload control for large‑scale microservice systems. Compared with Varys, Baraat, session‑based admission, and hierarchical overload control, DAGOR is service‑agnostic, non‑intrusive, and fully decentralized.
7. Conclusion
DAGOR provides a lightweight, service‑agnostic, independent‑yet‑collaborative, efficient, and fair overload control solution for massive microservice deployments. Deployed in WeChat for over five years, it has proven effective and offers design insights for other large‑scale systems.
Lessons Learned:
Overload control must be decentralized and autonomous per service.
Algorithms should combine multiple feedback signals rather than rely on open‑loop heuristics.
Effective designs stem from thorough analysis of real‑world workloads.
Acknowledgments
We thank the anonymous reviewers for their valuable feedback.
Footnotes
Examples of account services: login, personalization, contacts, message inbox.
Forms 2 and 3 also appear in GFS‑style systems where large files are split across storage servers.
Response time is defined as the interval between request arrival and response transmission.
Operation‑priority hash tables are rarely modified (≈once or twice per year).
Earlier APIs allowing developers to set service‑specific business priorities proved hard to manage and were removed.
Unless otherwise noted, “admission level” refers to the composite (B,U) level.
Ordering of admission levels: (B1,U1) < (B2,U2) if B1 < B2 or B1 = B2 and U1 < U2.
Requests may be retried up to three times after rejection.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Backend Team
Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
