Backend Development 30 min read

Mastering Traffic Governance: From Circuit Breakers to Rate Limiting for High‑Availability Systems

This article explains how traffic governance—through circuit breaking, isolation, retry strategies, degradation, timeout handling, and rate limiting—keeps distributed systems highly available, performant, and scalable, using concrete examples, formulas, and best‑practice patterns for modern microservice architectures.

Sanyou's Java Diary

Jun 20, 2024

Mastering Traffic Governance: From Circuit Breakers to Rate Limiting for High‑Availability Systems

1. Availability Definition

Availability is calculated as Availability = MTBF / (MTBF + MTTR) * 100% , where MTBF (Mean Time Between Failure) measures the average time between failures and MTTR (Mean Time To Repair) measures the average recovery time. Longer MTBF and shorter MTTR lead to higher overall availability.

2. Purpose of Traffic Governance

Traffic governance ensures balanced and efficient data flow, improves system adaptability to network conditions and failures, and protects service continuity.

3. Traffic Governance Techniques

3.1 Circuit Breaker

Three states: Closed (normal traffic, counting successes/failures), Open (immediate failure response), and Half‑Open (limited trial traffic). Traditional circuit breakers switch to Open when the error rate exceeds a threshold, then gradually return to Closed after a sleep period.

Google SRE introduces client‑side adaptive throttling: when requests > K * accepts, the client starts dropping requests locally with probability p computed as:

p = max(0, (requests - K*accepts) / (requests + 1))

Adjusting K makes the algorithm more aggressive ( K low) or conservative ( K high).

3.2 Isolation

Isolation limits the impact of a single service failure. Common strategies include:

Static/Dynamic Isolation : separate static resources (images, CSS) from dynamic services.

Read/Write Isolation (CQRS): separate read and write workloads into different services or databases.

Core/Non‑Core Isolation : prioritize resources for critical business services.

Hotspot Isolation : cache frequently accessed data to reduce backend pressure.

User Isolation : route tenants to dedicated service instances.

Process, Thread, Cluster, and Data‑Center Isolation : use containers, thread pools, separate clusters, or different data‑centers to contain failures.

3.3 Retry

Retry improves reliability but must be controlled to avoid amplification. Steps include error detection, retry decision (skip client‑error 4xx), retry policy (interval, count), and hedging (sending parallel requests and using the first response).

Synchronous retry : immediate re‑attempt on failure.

Asynchronous retry : enqueue failed requests for background processing.

Backoff strategies : linear, linear + jitter, exponential, exponential + jitter.

3.4 Degradation

Degradation sacrifices non‑critical functionality to preserve core services under overload. Strategies include automatic degradation based on error thresholds and manual degradation with graded impact levels.

3.5 Timeout

Timeout prevents long‑running requests from exhausting resources. Two main strategies:

Fixed timeout : static threshold per RPC.

EMA dynamic timeout : adjust timeout based on exponential moving average of response times, with upper bound Thwm and elastic limit Tmax.

Timeout propagation ensures downstream services respect the remaining time budget.

3.6 Rate Limiting

Rate limiting protects services from overload. Two categories:

Client‑side limiting : each caller respects a quota, often using token‑bucket or leaky‑bucket algorithms.

Server‑side limiting : the service drops or delays excess requests based on resource usage, success rate, or queue length. Implementations include Sentinel’s BBR‑like algorithm and WeChat’s queue‑time based throttling.

4. Summary

Combining circuit breaking, isolation, retry, degradation, timeout, and rate limiting creates a resilient, high‑performance, and scalable system that maintains high availability even under adverse network conditions and traffic spikes.

Designing for failure—from fault detection to graceful fallback—ensures continuous service delivery and a superior user experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices System Design retry rate limiting circuit breaker traffic governance

Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.