Design Techniques for High Availability in Large‑Scale Internet Architecture
This article explains the essential high‑availability design techniques for large‑scale internet systems, covering system splitting, decoupling, asynchronous processing, retry mechanisms, compensation, backup, multi‑active strategies, isolation, rate limiting, circuit breaking, and graceful degradation to ensure robust, scalable backend services.
Large‑scale internet architecture design emphasizes a four‑part combination: 四件套 consisting of 高并发, 高性能, 高可用, and 高扩展.
If you master these four aspects, handling big‑company interviews and everyday architectural design becomes straightforward.
Today, Tom explains the design tricks for 高可用.
1. System Splitting
When a monolithic system grows, a single mistake can cascade into a disaster. Early systems bundled modules like membership, product, order, logistics, and marketing into one codebase, causing whole‑system failures during traffic spikes.
Therefore, 系统拆分 (system splitting) became popular, leading to the microservice architecture that separates business domains according to DDD principles, isolates boundaries, and reduces risk propagation.
2. Decoupling
The principle of “high cohesion, low coupling” applies from interface abstraction and MVC layers to SOLID principles and the 23 design patterns, all aiming to lower inter‑module coupling.
For example, the Open/Closed Principle keeps extensions open and modifications closed, while Spring’s AOP (Aspect‑Oriented Programming) uses dynamic proxies to inject extra logic before or after method calls.
Event mechanisms based on the publish/subscribe pattern also allow new features to subscribe to events without invasive code changes.
3. Asynchrony
Synchronous calls block a thread until a response arrives, reducing efficiency. Asynchronous processing (e.g., thread pools, message queues) lets the thread continue with other work.
Non‑real‑time actions such as sending SMS, emails, or generating order snapshots can be handled asynchronously.
4. Retry
Network jitter or thread blockage can cause RPC timeouts. Retrying the request improves user experience but must be combined with idempotency to avoid duplicate operations (e.g., bank transfers).
Idempotent solutions include pre‑check queries, unique indexes, distributed locks, state machines, or token mechanisms.
5. Compensation
When retries are insufficient, compensation techniques achieve eventual consistency. Compensation can be forward (pushing failed tasks to success) or reverse (rolling back to the initial state).
Note: Compensation assumes the business can tolerate short‑term data inconsistency.
Implementation methods include local tables with scheduled scans, or using message middleware with retry capabilities.
6. Backup
Disaster recovery backup is a basic internet capability. For Redis, RDB provides full data sync, while AOF offers incremental log replay. Sentinel adds automatic master‑slave failover.
Other storage systems like MySQL, Kafka, HBase, and Elasticsearch also have backup mechanisms.
7. Multi‑Active Strategy
Beyond backup, multi‑active strategies (e.g., same‑city dual‑active, two‑site three‑center, three‑site five‑center, cross‑region dual‑active, cross‑region multi‑active) reduce risk from catastrophic events.
8. Isolation
Physical isolation separates low‑coupling systems into independent deployments, preventing faults from spreading. Each microservice has its own codebase and communicates via RPC.
9. Rate Limiting
To handle traffic spikes, rate limiting caps concurrent requests, protecting system stability. Limits can be applied per system, per API, per IP/device/user, or per appkey.
Common algorithms include counter, sliding window, leaky bucket, and token bucket.
10. Circuit Breaking
Circuit breakers protect downstream services by quickly failing calls when a resource becomes unstable. They have three states: Closed, Open, and Half‑Open, transitioning based on failure counts and timers.
Alibaba’s open‑source Sentinel provides a dashboard for defining resources and rules.
11. Degradation
Degradation temporarily disables non‑core features during high load, preserving limited resources for critical functions like order creation and payment.
Implementation varies by business and requires coordination with product owners.
In summary, degradation protects core system availability by shutting down optional services when resources are scarce.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
