Operations 13 min read

11 Essential Techniques to Build Highly Available Systems

Learn the eleven key strategies—including system splitting, decoupling, asynchronous processing, retries, compensation, backups, multi‑active deployment, isolation, rate limiting, circuit breaking, and degradation—that together form a robust high‑availability architecture for large‑scale internet services, ensuring reliability and scalability.

Sanyou's Java Diary

Aug 18, 2022

11 Essential Techniques to Build Highly Available Systems

Large‑scale internet architecture relies on a "four‑piece" combination: high concurrency, high performance, high availability, and high scalability. Mastering these aspects simplifies interview and design challenges.

Below are the eleven design techniques for achieving high availability.

1. System Splitting

Monolithic systems cause a single failure to cascade across the entire service. By splitting a system into independent microservices based on DDD principles, each sub‑system handles a specific business function, reducing risk propagation.

2. Decoupling

Apply the principle of high cohesion and low coupling: abstract interfaces, MVC layers, SOLID principles, and design patterns to minimize inter‑module dependencies. Example: the Open/Closed principle keeps extensions open while modifications are closed.

Spring AOP provides aspect‑oriented programming to inject cross‑cutting concerns without invasive code changes. Event‑driven architecture using publish/subscribe further isolates modules.

3. Asynchrony

Synchronous calls block the thread until a response arrives, reducing throughput. Asynchronous processing (e.g., thread pools, message queues) allows the thread to continue while background tasks handle non‑real‑time actions.

Example: after an order is created, a message is published to a queue; downstream tasks handle SMS, email, snapshot creation, etc., without delaying the user.

4. Retry

Network jitter or thread blockage can cause RPC timeouts. Retrying the request improves user experience but must be combined with idempotency to avoid duplicate operations (e.g., bank transfers).

Check existence before insert.

Add unique indexes.

Use a status flag (e.g., paid) with conditional updates.

Introduce distributed locks.

Apply token mechanisms to ensure a single successful request.

5. Compensation

When a request cannot be completed, compensation mechanisms achieve eventual consistency. Compensation can be forward (completing a partially failed transaction) or backward (rolling back to the initial state).

Note: Compensation assumes the business can tolerate short‑term data inconsistency.

Implementation examples include local tables with scheduled jobs, or message‑driven workflows that retry on failure.

6. Backup

Disaster recovery is essential. For Redis, RDB provides full data snapshots, while AOF records incremental changes. Sentinel offers automatic master‑slave failover.

7. Multi‑Active Strategy

Beyond backup, multi‑active deployments (same‑city dual‑active, two‑region three‑center, etc.) mitigate risks from data‑center failures, ensuring 24‑hour service availability.

8. Isolation

Physical isolation separates low‑coupling systems into independent deployments, preventing faults from cascading. Each subsystem maintains its own codebase and releases, communicating via RPC.

9. Rate Limiting

To protect against traffic spikes, limit the number of concurrent requests. Strategies include single‑node counters (e.g., AtomicLong) and distributed algorithms using a cluster.

Global request count per time window.

Per‑API request limits.

User/IP/Device‑level quotas.

App‑key specific rules for open platforms.

Counter‑based limiting.

Sliding‑window limiting.

Leaky‑bucket limiting.

Token‑bucket limiting.

10. Circuit Breaking

Circuit breakers protect downstream services by halting calls to unstable resources, allowing fast failures and preventing cascading errors. States include Closed, Open, and Half‑Open.

Alibaba's open‑source Sentinel provides a dashboard for defining resources and rules.

11. Degradation

When resources are scarce, temporarily disable non‑core features (e.g., product reviews, transaction logs) to preserve critical functions like order creation and payment.

Degradation plans must be tailored to each business scenario and agreed upon with stakeholders.

In summary, degradation protects core system availability by shutting down optional services during overload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices System Design

Written by

Sanyou's Java Diary

Passionate about technology, though not great at solving problems; eager to share, never tire of learning!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.