Operations 49 min read

How We Built DAGOR: A Scalable Overload Control System for Massive Microservices

This article presents DAGOR, a decentralized overload control framework designed for large‑scale microservice architectures like WeChat’s backend, detailing its service‑agnostic design, priority‑based admission policies, adaptive algorithms, and experimental evaluation that demonstrates improved success rates, fairness, and robustness under heavy load.

WeChat Backend Team

Jan 17, 2019

How We Built DAGOR: A Scalable Overload Control System for Massive Microservices

Abstract

Effective overload control for large‑scale online services is crucial to prevent backend overload. Traditional overload control targets individual services, but complex dependencies can cause system‑wide damage. DAGOR is a service‑agnostic, system‑centric overload control scheme for account‑oriented microservice architectures. It monitors load at the microservice granularity, triggers load shedding collaboratively, and has been deployed in WeChat’s backend for over five years, achieving high service success rates and fairness under overload.

Keywords

overload control, service admission control, microservice architecture, WeChat

1. Introduction

Overload control mitigates unresponsive services during spikes, essential for 24×7 online applications. Cloud elasticity alone cannot solve overload because providers are limited by purchased resources, requiring backend overload control.

Traditional overload control assumes few components and simple dependencies, focusing on OS, runtime, or application level. Modern services adopt SOA and microservice architectures with hundreds to thousands of services, making traditional designs insufficient.

Large microservice systems face three challenges: (1) all services must be monitored to avoid cascading overload; (2) combinatorial overload can drastically reduce throughput; (3) centralized admission control slows updates and coordination, so a decentralized, adaptive solution is needed.

2. Background

2.1 WeChat Service Architecture

WeChat’s backend is built on a microservice DAG where vertices are services and edges are call paths. Services are classified as basic (no outgoing edges) or leap (non‑zero outgoing edges). Entry services have zero indegree; all requests start at an entry service and end at a basic service.

WeChat processes billions of requests per second, with daily entry‑service request counts ranging from 10¹⁰ to 10¹¹.

2.2 Service Deployment

Over 3,000 services run on 20,000 machines, with frequent updates (≈1,000 changes per day). Centralized or SLA‑based overload control cannot keep up with such rapid evolution.

2.3 Dynamic Workload

Workloads vary dramatically; peak traffic can be 3× daily average, and up to 10× during events like Chinese New Year. Over‑provisioning is uneconomic, so adaptive overload control is required.

3. Overload in WeChat

Common overload causes include traffic spikes, service version changes, network failures, configuration changes, bugs, and hardware faults. Three basic overload forms are identified (see Figure 2): simple overload of a single service, and two combinatorial overloads where a service is called multiple times or multiple overloaded services are involved.

3.1 Overload Scenarios

Form 1: a basic service M overloads, affecting upstream services. Form 2: a service A calls M multiple times (subsequent overload). Form 3: upstream service calls multiple overloaded services. Combinatorial overload dramatically reduces throughput because success requires all downstream calls to succeed.

3.2 Challenges of Large‑Scale Overload Control

Unlike traditional three‑tier systems, WeChat has no single entry point, and request paths vary per request, making global admission control ineffective. Decentralized overload control must also avoid excessive request aborts that waste bandwidth and CPU.

4. DAGOR Overload Control

DAGOR satisfies the requirements with three design goals:

Service‑agnostic: Works for any microservice without needing service‑specific information.

Independent yet collaborative: Each server makes local decisions, while upstream servers cooperate via shared admission levels.

Efficient and fair: Minimizes wasted computation and ensures similar throughput for upstream services regardless of how many times they call overloaded downstream services.

4.1 Overload Detection

DAGOR monitors the average waiting time of requests in the server queue (not response time) because waiting time reflects server load without being polluted by downstream latency. A sliding window of either 1 s or 2,000 requests triggers an update. The overload threshold is 20 ms; the default request timeout is 500 ms.

4.2 Service Admission Control

4.2.1 Business‑oriented Admission

Requests receive a business priority (e.g., login > payment > instant messaging). All downstream requests inherit this priority, and low‑priority requests are shed first during overload.

4.2.2 User‑oriented Admission

User priority is generated by a per‑hour hash of the user ID, providing fairness across users while allowing occasional priority changes.

4.2.3 Session‑oriented Admission

Similar to user‑oriented, but based on session ID. In practice, users can game the system by logging out and back in, so DAGOR prefers user‑oriented admission.

4.2.4 Adaptive Admission

DAGOR adjusts the composite admission level (B, U) based on load. It maintains a histogram of accepted requests per (B,U) pair and uses a binary‑search‑like algorithm to find the highest level that keeps accepted requests below the expected load.

4.2.5 Collaborative Admission

Downstream servers attach their current admission level to responses; upstream servers use this information to locally reject requests that would be dropped downstream, reducing wasted network traffic.

4.3 Overload Control Workflow

Client request arrives at an entry service, which assigns business and user priorities.

Service invokes downstream services, propagating the priorities.

Each service applies admission control based on its current level and adjusts the level periodically.

Downstream responses carry the current admission level.

Upstream services update their records with the received level.

5. Evaluation

5.1 Experimental Setup

Experiments run on an internal cluster (Intel Xeon E5‑2698, 64 GB RAM, 10 GbE). Workloads simulate an encryption service (M) that saturates at ~750 QPS and a messaging service (A) that calls M multiple times. Four workload types (M1–M4) represent simple and combinatorial overload.

5.2 Overload Detection

DAGOR‑q (queue‑time based) is compared with DAGOR‑r (response‑time based). Results show DAGOR‑q delays shedding until higher input rates, avoiding false overload alarms.

5.3 Service Admission Control

DAGOR is compared with CoDel, SEDA, and a random shedding baseline. Under combinatorial overload (M2–M4), DAGOR achieves ~50 % higher success rates and approaches the theoretical optimum.

5.4 Fairness

Using a mixed workload of M1–M4, DAGOR maintains similar success rates across overload types, while CoDel favors simple overload, demonstrating DAGOR’s fairness.

6. Related Work

Prior overload control research focuses on databases, stream processing, sensor networks, and web services, which target single‑service architectures. DAGOR is the first to address overload control for large‑scale microservice systems. Compared with Varys, Baraat, session‑based admission, and hierarchical overload control, DAGOR is service‑agnostic, non‑intrusive, and fully decentralized.

7. Conclusion

DAGOR provides a lightweight, service‑agnostic, independent‑yet‑collaborative, efficient, and fair overload control solution for massive microservice deployments. Deployed in WeChat for over five years, it has proven effective and offers design insights for other large‑scale systems.

Lessons Learned:

Overload control must be decentralized and autonomous per service.

Algorithms should combine multiple feedback signals rather than rely on open‑loop heuristics.

Effective designs stem from thorough analysis of real‑world workloads.

Acknowledgments

We thank the anonymous reviewers for their valuable feedback.

Footnotes

Examples of account services: login, personalization, contacts, message inbox.

Forms 2 and 3 also appear in GFS‑style systems where large files are split across storage servers.

Response time is defined as the interval between request arrival and response transmission.

Operation‑priority hash tables are rarely modified (≈once or twice per year).

Earlier APIs allowing developers to set service‑specific business priorities proved hard to manage and were removed.

Unless otherwise noted, “admission level” refers to the composite (B,U) level.

Ordering of admission levels: (B1,U1) < (B2,U2) if B1 < B2 or B1 = B2 and U1 < U2.

Requests may be retried up to three times after rejection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

microservices overload control Scalable Systems load shedding priority admission

Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.