How to Build Highly Available Backend APIs: 10 Essential Design Principles
This article explains why high availability is crucial for backend services and outlines ten practical design principles—including dependency control, avoiding single points, load balancing, isolation, rate limiting, circuit breaking, async processing, degradation, gray release, and chaos engineering—to help developers create resilient APIs.
Preface
As a backend developer, creating service interfaces is routine, whether they serve front‑end HTTP requests or other services via RPC. Although the code may look simple, ensuring high availability is far from easy. This article discusses the key considerations for building highly available APIs and welcomes constructive feedback.
What Is High Availability?
In simple terms, high availability means a system’s ability to handle and mitigate risks.
Why Pursue High Availability?
Development errors can cause online incidents.
System operation depends on CPU, memory, disk, network, etc., any of which may fail.
User registration failures affect experience.
Big‑sale events (e.g., Double‑11, 618) can overload order services, hurting GMV.
Other unknown factors.
Therefore, we must design for high availability to cope with these uncontrollable factors.
Key Factors of High Availability
The essence of high availability is the system’s capacity to confront and avoid risks. From this perspective, four critical factors shape a high‑availability interface design: Dependence, Probability, Time, and Scope.
Minimize dependent resources.
Keep risk probability low.
Limit the impact scope.
Shorten the impact duration.
Design Principles for Highly Available Interfaces
Based on the above factors, consider the following practical guidelines.
1. Control Dependencies
Reduce dependencies whenever possible and avoid strong coupling.
Less Dependency
For example, handling ten requests per minute with a MySQL query is sufficient; introducing Redis unnecessarily wastes resources and adds complexity.
Weak Dependency
When a user‑registration service strongly depends on a coupon‑issuing service, a failure in the latter makes registration unavailable. Using asynchronous processing creates a weak dependency, so coupon service outages do not block registration.
2. Avoid Single Points of Failure
Mitigate single‑point failures through redundancy and backup.
Deploy applications across multiple data centers and machines so that if one server fails, others continue serving.
Retain the previous version after each release to enable quick rollback.
Ensure at least two people understand each business interface for rapid incident response.
Use master‑slave setups for databases and caches like MySQL or Redis.
3. Load Balancing
Distribute risk by spreading traffic across multiple nodes.
For instance, Nginx or JSF load balancers disperse requests to avoid bottlenecks on a single server.
When caching with JIMDB, hotspot keys can overload a shard, causing high CPU usage and timeouts. Interface design should consider data‑store balance and monitor hotspots for dynamic rebalancing.
4. Resource Isolation
Isolate resources to contain failures.
Physical separation of service deployments prevents a single‑machine or single‑room failure from affecting the whole system.
Sharding databases and tables ensures that a server crash does not bring down the entire service.
5. Rate Limiting
Rate limiting protects both the service itself and its downstream dependencies.
The current JSF platform already provides flow‑control capabilities, and custom limits can be added as needed.
6. Service Circuit Breaking
Circuit breaking isolates failing downstream services to prevent cascading failures.
When service A calls B, C, and D, a failure in any of them can degrade A. Using tools like Hystrix or DUCC can downgrade strong dependencies to weak ones.
7. Asynchronous Processing
Convert synchronous operations to asynchronous ones.
During high‑traffic promotions, user reward requests can be queued via MQ and processed later, reducing load and limiting incident impact.
8. Degradation Plans
Degradation is a post‑incident mitigation that narrows the impact scope.
Critical interfaces should have well‑defined fallback strategies, ensuring non‑core functions can be disabled while core services remain operational.
9. Gray Release
Gradual rollout limits risk exposure.
Deploy a new service to a subset of users, collect feedback on performance and stability, then expand or roll back based on results.
10. Chaos Engineering
Proactively inject failures to uncover hidden issues.
Complex systems with many dependencies can exhibit butterfly effects. Using platforms like the Tai Shan chaos‑engineering tool, simulate failures and prepare response plans to keep risk within controllable bounds.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
