11 Essential Techniques to Build Highly Available Systems
Learn the eleven key strategies—including system splitting, decoupling, asynchronous processing, retries, compensation, backups, multi‑active deployment, isolation, rate limiting, circuit breaking, and degradation—that together form a robust high‑availability architecture for large‑scale internet services, ensuring reliability and scalability.
Large‑scale internet architecture relies on a "four‑piece" combination: high concurrency, high performance, high availability, and high scalability. Mastering these aspects simplifies interview and design challenges.
Below are the eleven design techniques for achieving high availability.
1. System Splitting
Monolithic systems cause a single failure to cascade across the entire service. By splitting a system into independent microservices based on DDD principles, each sub‑system handles a specific business function, reducing risk propagation.
2. Decoupling
Apply the principle of high cohesion and low coupling: abstract interfaces, MVC layers, SOLID principles, and design patterns to minimize inter‑module dependencies. Example: the Open/Closed principle keeps extensions open while modifications are closed.
Spring AOP provides aspect‑oriented programming to inject cross‑cutting concerns without invasive code changes. Event‑driven architecture using publish/subscribe further isolates modules.
3. Asynchrony
Synchronous calls block the thread until a response arrives, reducing throughput. Asynchronous processing (e.g., thread pools, message queues) allows the thread to continue while background tasks handle non‑real‑time actions.
Example: after an order is created, a message is published to a queue; downstream tasks handle SMS, email, snapshot creation, etc., without delaying the user.
4. Retry
Network jitter or thread blockage can cause RPC timeouts. Retrying the request improves user experience but must be combined with idempotency to avoid duplicate operations (e.g., bank transfers).
Check existence before insert.
Add unique indexes.
Use a status flag (e.g., paid) with conditional updates.
Introduce distributed locks.
Apply token mechanisms to ensure a single successful request.
5. Compensation
When a request cannot be completed, compensation mechanisms achieve eventual consistency. Compensation can be forward (completing a partially failed transaction) or backward (rolling back to the initial state).
Note: Compensation assumes the business can tolerate short‑term data inconsistency.
Implementation examples include local tables with scheduled jobs, or message‑driven workflows that retry on failure.
6. Backup
Disaster recovery is essential. For Redis, RDB provides full data snapshots, while AOF records incremental changes. Sentinel offers automatic master‑slave failover.
7. Multi‑Active Strategy
Beyond backup, multi‑active deployments (same‑city dual‑active, two‑region three‑center, etc.) mitigate risks from data‑center failures, ensuring 24‑hour service availability.
8. Isolation
Physical isolation separates low‑coupling systems into independent deployments, preventing faults from cascading. Each subsystem maintains its own codebase and releases, communicating via RPC.
9. Rate Limiting
To protect against traffic spikes, limit the number of concurrent requests. Strategies include single‑node counters (e.g., AtomicLong) and distributed algorithms using a cluster.
Global request count per time window.
Per‑API request limits.
User/IP/Device‑level quotas.
App‑key specific rules for open platforms.
Counter‑based limiting.
Sliding‑window limiting.
Leaky‑bucket limiting.
Token‑bucket limiting.
10. Circuit Breaking
Circuit breakers protect downstream services by halting calls to unstable resources, allowing fast failures and preventing cascading errors. States include Closed, Open, and Half‑Open.
Alibaba's open‑source Sentinel provides a dashboard for defining resources and rules.
11. Degradation
When resources are scarce, temporarily disable non‑core features (e.g., product reviews, transaction logs) to preserve critical functions like order creation and payment.
Degradation plans must be tailored to each business scenario and agreed upon with stakeholders.
In summary, degradation protects core system availability by shutting down optional services during overload.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Sanyou's Java Diary
Passionate about technology, though not great at solving problems; eager to share, never tire of learning!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
