Mastering High Availability and Concurrency: Core Principles and Practical Techniques
This article distills essential guiding principles, high‑availability strategies, and high‑concurrency techniques for building resilient, scalable systems, covering stateless design, fault‑handling phases, replication, isolation, rate limiting, caching, async processing, multithreading, and scaling approaches.
Guiding Principles
The article is inspired by Zhang Kaitao’s book Core Technologies of Billion‑Scale Traffic Websites and is divided into three parts: guiding principles, high availability, and high concurrency, largely reflecting the author’s own thoughts.
High‑Concurrency Principles
Stateless design – avoiding state to prevent lock contention.
Reasonable granularity – controlling service granularity to disperse requests and improve manageability.
Cache, queue, and concurrency techniques – to be used as appropriate for the scenario.
High‑Availability Principles
Every deployment must be rollback‑capable.
External dependencies must be measurable for graceful degradation and provide a degradation switch.
Public interfaces must be rate‑limited with accurate limits.
Business Design Principles
Security – anti‑scraping, anti‑duplicate submissions, etc.
Idempotent design where appropriate.
Dynamic business processes and rules.
Ownership, backup personnel, on‑call rotation.
Comprehensive documentation.
Traceable backend operations.
These principles represent only a fraction of the vast design space; practitioners should accumulate experience over time.
High Availability
High availability means resisting uncertainty to guarantee 24/7 healthy service. Uncertainties include natural disasters, staff turnover, downstream failures, and hardware faults. Availability is often expressed as “N 9s”; higher N incurs higher cost, so balancing cost and required availability is crucial. Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) are also important.
Fault‑handling can be divided into four phases:
Pre‑incident – replication, isolation, quota, pre‑planning, probing.
During incident – monitoring and alerting.
Mid‑incident – degradation, rollback, emergency plans, fail‑XXX series.
Post‑incident – post‑mortem, reflection, technical improvement.
Pre‑incident Techniques
Replication
Replication is a powerful weapon against uncertainty, used in stateless service clusters, storage systems (e.g., MySQL master‑slave, RAID, distributed NoSQL partitions), and many other high‑availability components.
Isolation
Various forms of isolation (thread, process, cluster, data‑center, read/write, hot/cold) essentially provide resource isolation, protecting each resource from failures in others.
Quota
Quota limits resource consumption to protect the system; rate limiting is a common quota technique, implemented either cluster‑wide or per‑instance.
Probing
Probing (stress testing, drills) assesses current availability but does not improve it directly; it includes full‑link load testing and various disaster‑recovery drills.
During Incident
Monitoring and Alerting
Effective monitoring and alerting answer three key questions: why was the fault not detected earlier, why was it not resolved sooner, and what is the impact?
Mid‑incident Techniques
Degradation
Degradation sacrifices non‑critical functionality to keep the overall system alive, typically via circuit breaking or fallback paths, with decisions driven by thresholds or manual intervention.
Rollback
Rollback restores a previous stable state; it requires that changes be designed to be rollback‑able, leveraging database transactions, version control, or deployment tools.
Fail‑XXX Series
fail‑retry – retry with back‑off.
fail‑over – switch to alternative instances or replicas.
fail‑safe – silent fallback when the downstream is weakly dependent.
fail‑fast – immediate error reporting.
fail‑back – delayed compensation (e.g., replay via message queue).
Retry policies must balance back‑off intervals and retry counts to avoid overwhelming downstream services.
Post‑incident
Post‑mortem analysis, reflection, and technical improvements complete the fault‑handling cycle.
High Concurrency
Beyond high availability, systems must sustain large request volumes without sacrificing reliability. The challenge is to maintain service quality under high load.
An everyday analogy is a checkout line: speed up cashiers, add more cashiers, or reduce the number of customers. Similarly, high concurrency can be addressed by:
Increasing processing speed – caching and asynchronous processing.
Adding processing “hands” – multithreading/multiprocessing and scaling.
Reducing incoming traffic – pre‑processing (out of scope).
Increasing Processing Speed
Cache
Caching improves speed by storing frequently accessed data closer to the consumer. Consider cache hit rate, eviction policies (LRU, FIFO, LFU), placement (in‑process, off‑heap, distributed), and challenges such as null‑penetration, cache stampede, hot‑key, consistency, and read‑write amplification.
Asynchronous Processing
Asynchrony can be achieved by:
Converting multiple synchronous calls into parallel asynchronous calls (reducing total latency to the max of individual latencies).
Using async I/O provided by frameworks.
Offloading work to message‑queue middleware for later processing, which adds throughput, peak‑shaving, eventual consistency, and decoupling.
Typical queue types include buffer queues, task queues, message queues, request queues, data‑bus queues, priority queues, replica queues, and mirror queues.
Adding Processing Hands
Multithreading
Thread (or process) pools are widely used in web servers, gateways, RPC services, and queue consumers. Thread count should be calculated based on average processing time, peak concurrency, blocking rate, acceptable response time, and CPU cores.
Scaling
Scaling can be vertical (scale‑up) or horizontal (scale‑out). Stateless horizontal scaling adds machines; stateful scaling involves sharding or replication, requiring consistency algorithms such as Paxos or Raft.
Source: http://kriszhang.com/high_performance/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
