Operations 32 min read

Mastering High Availability: Traffic Governance, Circuit Breakers, Isolation, Retries, Timeouts and Rate Limiting

This article explains how to achieve the three‑high goals of high performance, high availability and easy scalability in microservice systems by using traffic governance techniques such as circuit breaking, various isolation strategies, retry mechanisms, timeout controls, degradation tactics and rate‑limiting, illustrated with practical examples and diagrams.

Sanyou's Java Diary

Dec 28, 2023

Mastering High Availability: Traffic Governance, Circuit Breakers, Isolation, Retries, Timeouts and Rate Limiting

Guide

In human health, the "three highs" (high blood pressure, high blood sugar, high cholesterol) are dangerous, while in computing the "three highs" (high performance, high availability, high scalability) are the ultimate health goals. This article shows how traffic governance helps keep a system healthy.

1. Definition of Availability

2. Purpose of Traffic Governance

3. Traffic Governance Methods

4. Summary

1. Definition of Availability

Using the O2 advertising system as an example, availability is defined as the proportion of time a service is operational. The availability formula is Availability = MTBF / (MTBF + MTTR) × 100% , where MTBF is Mean Time Between Failures and MTTR is Mean Time To Repair.

Longer MTBF and shorter MTTR lead to higher availability.

2. Purpose of Traffic Governance

Traffic governance balances and optimizes data flow, improves system adaptability to network conditions and failures, and ensures continuous, efficient service.

Network performance optimization : load balancing, resource utilization, latency reduction.

Service quality assurance : prioritize critical traffic.

Fault tolerance and resilience : dynamic routing, traffic redirection, self‑recovery.

Security : traffic encryption, access control, intrusion detection.

Cost efficiency : reduce bandwidth usage and related costs.

3. Traffic Governance Methods

3.1 Circuit Breaker

Traditional circuit breakers have three states: Closed (normal), Open (reject requests), Half‑Open (test a limited number of requests). When the failure rate exceeds a threshold, the breaker opens, sleeps, then tests the service again.

Google SRE circuit breaker adds client‑side adaptive throttling, allowing a small portion of traffic even when the breaker is open.

/* Pseudo‑code */
ConnectWithBackoff()
  current_backoff = INITIAL_BACKOFF
  current_deadline = now() + INITIAL_BACKOFF
  while (TryConnect(max(current_deadline, now() + MIN_CONNECT_TIMEOUT)) != SUCCESS) {
    SleepUntil(current_deadline)
    current_backoff = min(current_backoff * MULTIPLIER, MAX_BACKOFF)
    current_deadline = now() + current_backoff +
      UniformRandom(-JITTER * current_backoff, JITTER * current_backoff)
  }

3.2 Isolation

Isolation prevents a single service failure from cascading through the system.

Dynamic vs static isolation : separate dynamic (computed) and static (cached) content.

Read‑write isolation : use CQRS to separate read and write services.

Event‑driven isolation : publish events after writes for read services to update.

Core isolation : prioritize core business services in resource allocation.

Hotspot isolation : cache top‑K hot data to reduce pressure on back‑end storage.

User isolation : per‑tenant or per‑group service instances.

Cluster, data‑center, process, thread isolation : deploy services in separate clusters, containers, processes, or thread pools.

3.3 Retry

Retry helps recover from transient network failures but must be carefully controlled to avoid overload.

Retry decision : distinguish client‑side (4xx) errors that should not be retried.

Retry strategies : fixed interval, linear backoff, exponential backoff, each optionally with jitter to avoid thundering‑herd problems.

Synchronous retry : immediate retry after a failure.

Asynchronous retry : push failed requests to a message queue for later processing.

Retry storm mitigation : limit per‑service retries, use sliding‑window counters, propagate special status codes to stop upstream retries.

Hedging : fire multiple parallel requests and use the first successful response.

3.4 Degradation

Degradation sacrifices non‑essential functionality to preserve core capacity during overload.

Automatic degradation : trigger when failure rates or latency exceed thresholds.

Manual degradation : operator‑driven, with graded impact levels.

Execution : prioritize simple, automated checks, define degradation levels, and practice drills.

Difference from rate limiting : degradation reduces features, rate limiting reduces traffic volume.

3.5 Timeout

Timeout prevents long‑running requests from exhausting resources.

Fixed timeout : a static threshold.

EMA dynamic timeout : adjust based on exponential moving average of response times; if average exceeds a threshold, shrink timeout, otherwise allow longer waits.

Timeout should be propagated downstream so each service knows the remaining time budget.

3.6 Rate Limiting

Rate limiting protects services from sudden traffic spikes.

Client‑side limiting : each caller respects a quota allocated by the callee.

Server‑side limiting : monitor CPU, success rate, latency, queue time; drop or delay requests when overload is detected.

Algorithms : token bucket, leaky bucket, sliding window, each with trade‑offs.

Summary

Achieving the three‑high goals requires a combination of circuit breaking, isolation, retry, degradation, timeout and rate‑limiting strategies. Together they form a resilient, scalable, and high‑performance system that can gracefully handle failures and traffic spikes.

High availability is fundamentally “designing for failure”: every component is assumed to fail, and the architecture must provide fault tolerance, self‑recovery, and graceful degradation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices rate limiting timeout circuit breaker retry strategy traffic governance

Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Guide

Table of Contents

1. Definition of Availability

2. Purpose of Traffic Governance

3. Traffic Governance Methods

3.1 Circuit Breaker

3.2 Isolation

3.3 Retry

3.4 Degradation

3.5 Timeout

3.6 Rate Limiting

Summary

Sanyou's Java Diary

How this landed with the community

Was this worth your time?

0 Comments