Operations 14 min read

Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

This article outlines ten practical techniques—including system splitting, decoupling, asynchronous processing, retry strategies, compensation, backup, multi‑active deployment, isolation, rate limiting, circuit breaking, and degradation—to help engineers design highly available, resilient architectures for large‑scale internet applications.

Sanyou's Java Diary
Sanyou's Java Diary
Sanyou's Java Diary
Mastering High Availability: 10 Essential Design Techniques for Scalable Systems

Hello, I am San You.

Large‑scale internet architecture design relies on a four‑piece combination: high concurrency, high performance, high availability, and high scalability.

If you master these four aspects, tackling big‑company interviews and everyday architectural design becomes straightforward.

Today we focus on the design tricks for high availability.

1. System Splitting

When a monolithic system grows, a single mistake can cascade into a disaster. Traditional monoliths (e.g., e‑commerce platforms where membership, product, order, logistics, marketing are all in one) suffer from whole‑system failures during traffic spikes.

Therefore, system splitting becomes a common solution, leading to microservice architectures that separate business domains according to DDD principles, isolate boundaries, and reduce risk propagation.

2. Decoupling

The principle of “high cohesion, low coupling” applies from interface abstraction and MVC layers to SOLID principles and the 23 design patterns. Reducing coupling prevents a change in one module from affecting the whole system.

For example, the Open‑Closed Principle keeps extensions open and modifications closed. Spring’s AOP (Aspect‑Oriented Programming) uses dynamic proxies to intercept method calls, allowing extra logic before or after execution.

Event mechanisms (publish/subscribe) also enable non‑intrusive extensions: new features subscribe to events without modifying existing code.

3. Asynchronous Processing

Synchronous calls block the thread until a response arrives, reducing efficiency. Asynchronous processing (e.g., thread pools, message queues) lets the thread continue with other work while the response is pending.

4. Retry

Network jitter or thread blockage can cause RPC timeouts. Retrying requests improves user experience, but blind retries may cause issues (e.g., duplicate bank transfers). Retries should be combined with idempotency checks.

Query before insert to avoid duplicates

Add unique indexes

Use a “dead‑letter” table

Introduce state machines (e.g., order status “paid” with conditional updates)

Apply distributed locks

Use token mechanisms to ensure a request is processed only once

5. Compensation

When retries are insufficient, compensation techniques achieve eventual consistency. Compensation can be forward (completing partially failed distributed transactions) or reverse (rolling back to the initial state).

Note: Compensation assumes the business can tolerate short‑term data inconsistency.

Implementation examples include local tables with scheduled scans, or simple message‑queue‑driven compensation tasks that leverage MQ retry mechanisms.

6. Backup

Any server may crash, risking data loss. Disaster‑recovery backup is a fundamental capability. For Redis, RDB provides full data sync, while AOF offers incremental log replay. Sentinel adds automatic master‑slave failover.

Other storage systems (MySQL, Kafka, HBase, Elasticsearch) also provide backup mechanisms to prevent data loss.

7. Multi‑Active Strategy

Beyond backup, multi‑active deployments (e.g., same‑city dual‑active, two‑region three‑center, three‑region five‑center, cross‑region dual‑active) reduce risk from catastrophic events like power outages or natural disasters.

8. Isolation

Physical isolation separates low‑coupling systems into independent deployments, preventing faults from cascading. Each subsystem has its own codebase, development, and release pipeline, communicating via RPC.

9. Rate Limiting

During traffic spikes, unrestricted requests can overwhelm CPU, memory, and load. Rate limiting caps concurrent requests, ensuring the system remains responsive for a subset of users while discarding excess traffic.

Limit the number of concurrent requests reaching the system to maintain overall availability.

Rate limiting can be implemented as single‑machine (in‑memory counters) or distributed (cluster‑wide coordination). It supports dimensions such as total system QPS, per‑API limits, per‑IP/user limits, and per‑appkey rules.

Counter‑based limiting

Sliding‑window limiting

Leaky‑bucket limiting

Token‑bucket limiting

10. Circuit Breaking

Circuit breakers detect unstable resources (high latency or error rates) and quickly fail subsequent calls, preventing cascading failures. They have three states: Closed (normal), Open (reject requests), and Half‑Open (test recovery).

11. Degradation

Degradation temporarily disables non‑core features (e.g., product reviews, transaction logs) during peak load, preserving critical functions like order creation and payment.

Different businesses adopt varied degradation strategies, requiring collaboration with product owners to define acceptable trade‑offs.

In summary, degradation protects core system availability by shutting down optional services when resources are constrained.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

microservicessystem designFault Tolerance
Sanyou's Java Diary
Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.