Operations 33 min read

Mastering High‑Availability: Traffic Governance Techniques for Resilient Systems

This article explains how traffic governance—including circuit breaking, isolation, retry, downgrade, timeout, and rate‑limiting—keeps microservice architectures highly available, performant, and easy to scale by balancing load, preventing cascading failures, and optimizing resource usage.

dbaplus Community

Jun 10, 2024

Mastering High‑Availability: Traffic Governance Techniques for Resilient Systems

Background and Definition of Availability

High availability (HA) is measured by the formula Availability = MTBF / (MTBF + MTTR) * 100%, where MTBF (Mean Time Between Failure) is the average time a system runs without failure and MTTR (Mean Time To Repair) is the time needed to recover. A larger MTBF and a smaller MTTR yield higher availability.

The article uses Tencent’s internal O2 advertising‑delivery system as a concrete example. O2 aims for three "highs"—high performance, high availability, and easy scalability—similar to the three health metrics for humans.

Purpose of Traffic Governance

Traffic governance ensures that data flow remains balanced and efficient, acting like a nutritionist for system health. Its goals include:

Optimising network performance through load balancing and traffic distribution.

Guaranteeing service‑level quality by prioritising critical traffic.

Providing fault tolerance and resilience via dynamic routing and failover.

Enhancing security with traffic encryption, access control, and intrusion detection.

Improving cost‑effectiveness by reducing bandwidth consumption and overall system load.

Key Mechanisms

1. Circuit Breaker

Two main designs are discussed:

Traditional circuit breaker : Three states—Closed, Open, Half‑Open. When the error‑rate threshold is exceeded, the breaker opens, sleeps for a configurable period, then enters Half‑Open to probe the downstream service.

Google SRE circuit breaker : Uses client‑side adaptive throttling. The client tracks requests (total attempts) and accepts (successful responses) over a sliding window. When requests > K * accepts, the client starts dropping requests locally with probability p, where p is computed from the excess ratio. The multiplier K (commonly 2) controls aggressiveness.

Both approaches aim to fail fast and prevent a local hotspot from causing a system‑wide avalanche.

2. Isolation

Isolation limits the blast radius of failures. Common strategies include:

Static vs. dynamic content isolation : Separate handling of dynamic data (real‑time queries, DB reads) and static assets (images, CSS, JS).

Read/Write isolation (CQRS) : Separate services for reads and writes, allowing independent scaling.

Core vs. non‑core isolation : Prioritise resources for core business services.

Hotspot isolation : Cache or route high‑frequency data separately.

User isolation : Partition tenants so a failure affects only a subset of users.

Process, thread, cluster, and data‑center isolation : Use containers, separate thread pools, dedicated clusters, or geographically distinct data centers to contain faults.

3. Retry

Retry mitigates transient network glitches but must be carefully controlled to avoid amplification:

Perception of errors : Distinguish client‑side (4xx) from server‑side errors.

Retry decision : Skip retries for non‑idempotent or client‑error responses.

Retry strategy : Choose between synchronous (immediate retry) and asynchronous (queue‑based) retries.

Maximum attempts : Prevent infinite loops; typical limits are 2‑3 attempts.

Back‑off policies : Linear, linear‑with‑jitter, exponential, exponential‑with‑jitter, and exponential‑with‑jitter‑plus‑random (gRPC style). The article provides a gRPC back‑off pseudocode example:

ConnectWithBackoff() {
  current_backoff = INITIAL_BACKOFF;
  current_deadline = now() + INITIAL_BACKOFF;
  while (TryConnect(max(current_deadline, now() + MIN_CONNECT_TIMEOUT)) != SUCCESS) {
    SleepUntil(current_deadline);
    current_backoff = min(current_backoff * MULTIPLIER, MAX_BACKOFF);
    current_deadline = now() + current_backoff + UniformRandom(-JITTER * current_backoff, JITTER * current_backoff);
  }
}

Parameters such as INITIAL_BACKOFF, MULTIPLIER, JITTER, MAX_BACKOFF, and MIN_CONNECT_TIMEOUT are explained in the article.

4. Downgrade (Degrade)

When load exceeds capacity, downgrade disables or simplifies non‑critical features to preserve core functionality. Strategies include:

Automatic downgrade : Triggered by measurable thresholds (e.g., error rate, latency).

Manual downgrade : Human‑initiated based on business impact.

Execution flow : Gradual reduction from shallow to deep impact, with clear rollback procedures.

Downgrade differs from rate limiting: downgrade sacrifices functionality, while rate limiting sacrifices traffic volume.

5. Timeout

Timeouts prevent long‑running requests from exhausting resources. Two main approaches:

Fixed timeout : Static threshold per RPC.

EMA‑based dynamic timeout : Uses an Exponential Moving Average of response times to adapt the timeout. If the EMA exceeds a hard limit Thwm, the dynamic timeout Tdto shrinks toward Thwm; otherwise it can grow up to a maximum Tmax. The algorithm is open‑sourced at github.com/jiamao/ema-timeout.

Timeout propagation across RPC chains is essential: each downstream service should respect the remaining time budget to avoid wasteful work.

6. Rate Limiting

Rate limiting protects services from overload and controls user behaviour. Two categories:

Client‑side limiting : Quotas assigned by the provider, often enforced with token‑bucket or leaky‑bucket algorithms.

Server‑side limiting : The service drops or delays excess requests based on resource usage, success rate, or latency. Implementations include sliding‑window, token‑bucket, and algorithms used by Sentinel (BBR‑style) and WeChat backend (average queue‑time threshold).

Summary of Strategies

Circuit Breaker : Traditional and Google SRE models to stop cascading failures.

Isolation : Physical and logical separation (static/dynamic, read/write, core/non‑core, hotspot, tenant, process, thread, cluster, data‑center).

Retry : Synchronous/asynchronous with back‑off policies, avoiding retry storms.

Downgrade : Automatic and manual, balancing user experience vs. system load.

Timeout : Fixed and EMA‑based dynamic timeouts with proper propagation.

Rate Limiting : Client‑side and server‑side mechanisms to keep the system stable under burst traffic.

By combining these techniques, a system can achieve the "three‑high" goals—high performance, high availability, and easy scalability—while remaining resilient to network fluctuations, component failures, and traffic spikes.

Illustrative Images

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability retry rate limiting timeout circuit breaker isolation traffic governance downgrade

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.