Design Techniques for High Availability in Large‑Scale Internet Architecture

This article explains the essential high‑availability design techniques for large‑scale internet systems, covering system splitting, decoupling, asynchronous processing, retry mechanisms, compensation, backup, multi‑active strategies, isolation, rate limiting, circuit breaking, and graceful degradation to ensure robust, scalable backend services.

IT Services Circle
IT Services Circle
IT Services Circle
Design Techniques for High Availability in Large‑Scale Internet Architecture

Large‑scale internet architecture design emphasizes a four‑part combination: 四件套 consisting of 高并发, 高性能, 高可用, and 高扩展.

If you master these four aspects, handling big‑company interviews and everyday architectural design becomes straightforward.

Today, Tom explains the design tricks for 高可用.

1. System Splitting

When a monolithic system grows, a single mistake can cascade into a disaster. Early systems bundled modules like membership, product, order, logistics, and marketing into one codebase, causing whole‑system failures during traffic spikes.

Therefore, 系统拆分 (system splitting) became popular, leading to the microservice architecture that separates business domains according to DDD principles, isolates boundaries, and reduces risk propagation.

2. Decoupling

The principle of “high cohesion, low coupling” applies from interface abstraction and MVC layers to SOLID principles and the 23 design patterns, all aiming to lower inter‑module coupling.

For example, the Open/Closed Principle keeps extensions open and modifications closed, while Spring’s AOP (Aspect‑Oriented Programming) uses dynamic proxies to inject extra logic before or after method calls.

Event mechanisms based on the publish/subscribe pattern also allow new features to subscribe to events without invasive code changes.

3. Asynchrony

Synchronous calls block a thread until a response arrives, reducing efficiency. Asynchronous processing (e.g., thread pools, message queues) lets the thread continue with other work.

Non‑real‑time actions such as sending SMS, emails, or generating order snapshots can be handled asynchronously.

4. Retry

Network jitter or thread blockage can cause RPC timeouts. Retrying the request improves user experience but must be combined with idempotency to avoid duplicate operations (e.g., bank transfers).

Idempotent solutions include pre‑check queries, unique indexes, distributed locks, state machines, or token mechanisms.

5. Compensation

When retries are insufficient, compensation techniques achieve eventual consistency. Compensation can be forward (pushing failed tasks to success) or reverse (rolling back to the initial state).

Note: Compensation assumes the business can tolerate short‑term data inconsistency.

Implementation methods include local tables with scheduled scans, or using message middleware with retry capabilities.

6. Backup

Disaster recovery backup is a basic internet capability. For Redis, RDB provides full data sync, while AOF offers incremental log replay. Sentinel adds automatic master‑slave failover.

Other storage systems like MySQL, Kafka, HBase, and Elasticsearch also have backup mechanisms.

7. Multi‑Active Strategy

Beyond backup, multi‑active strategies (e.g., same‑city dual‑active, two‑site three‑center, three‑site five‑center, cross‑region dual‑active, cross‑region multi‑active) reduce risk from catastrophic events.

8. Isolation

Physical isolation separates low‑coupling systems into independent deployments, preventing faults from spreading. Each microservice has its own codebase and communicates via RPC.

9. Rate Limiting

To handle traffic spikes, rate limiting caps concurrent requests, protecting system stability. Limits can be applied per system, per API, per IP/device/user, or per appkey.

Common algorithms include counter, sliding window, leaky bucket, and token bucket.

10. Circuit Breaking

Circuit breakers protect downstream services by quickly failing calls when a resource becomes unstable. They have three states: Closed, Open, and Half‑Open, transitioning based on failure counts and timers.

Alibaba’s open‑source Sentinel provides a dashboard for defining resources and rules.

11. Degradation

Degradation temporarily disables non‑core features during high load, preserving limited resources for critical functions like order creation and payment.

Implementation varies by business and requires coordination with product owners.

In summary, degradation protects core system availability by shutting down optional services when resources are scarce.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendDistributed SystemsSystem Design
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.