Common Architectural Design Risks and Mitigation Strategies for System Stability
This article analyses fifteen typical architectural design risks—such as duplicate interactions, high‑frequency calls, redundant requests, non‑reentrant interfaces, unreasonable timeouts, retry misconfigurations, IP direct‑connect, cross‑datacenter calls, weak/strong dependencies, third‑party reliance, cache penetration, cache avalanche, and coupling issues—explaining their definitions, impacts, detection methods, and concrete mitigation measures with real‑world Baidu cases to help engineers improve system stability.
Background
The online failures shown in the figure often affect product lines that have previously suffered similar issues; these faults are hard to detect in offline testing and even online verification, yet they erupt on "no‑change safe days" and severely degrade system stability metrics.
Case‑by‑case analysis is inefficient and cannot systematically eliminate blind spots in stability testing. The author clusters problems, proposes generic testing methods, validates them across multiple systems, and shares the experience for others to adopt.
System availability is measured by MTBF/(MTBF+MTTR). The table below illustrates how a 5‑nine availability translates to only 25 seconds of monthly downtime, while a 3‑nine availability still allows over 40 minutes of downtime, highlighting the need for rigorous risk control.
Risk types are grouped by lifecycle stage: architecture design risk, coding risk, security risk, process‑norm risk, operation risk, and monitoring risk.
Architecture Design Risks
Architecture design risks are easily overlooked during early development. Early detection reduces later maintenance cost, while design flaws can affect entire modules and incur huge repair costs.
Typical dimensions: interaction, dependency, coupling.
Interaction risks include duplicate interactions, high‑frequency calls, redundant/unused interactions, non‑reentrant interfaces, unreasonable timeout settings, and improper retry configurations.
Dependency risks cover unreasonable strong/weak dependencies and invalid dependencies.
Coupling risks involve unreasonable architecture or cache coupling.
1. Duplicate Interaction
Risk Definition: The system issues multiple identical network calls within a single business request, either across the whole request or within the same layer.
Impact: Increases interface latency, reduces performance, and multiplies downstream pressure.
Identification: If two calls target the same service (e.g., MySQL, Redis), have identical request data and responses, they are duplicate. Trace systems can automatically flag such patterns.
Mitigation: Cache the result of the first query when real‑time requirements allow.
Real‑world case 1: 11 duplicate session requests caused massive traffic spikes and degraded session service.
Real‑world case 2: Repeated MySQL/Redis calls accelerated performance bottlenecks.
2. High‑Frequency Interaction
Risk Definition: The number of interactions depends on the size of upstream data; loops that trigger network calls cause a surge in downstream load.
Impact: Excessive loops amplify downstream pressure, destabilize interfaces, and can cause system snowballing under large data volumes.
Identification: Use trace data to detect loops where each array element triggers a network request; set sensible limits on data size.
Mitigation: Impose data size caps and replace per‑item calls with batch requests.
Real‑world case: A merchant material query generated 156 DB calls, leading to >3 seconds latency.
3. Redundant/Unused Interaction
Risk Definition: After a dependent data error, subsequent interactions continue even though they serve no purpose.
Impact: Wastes resources and degrades performance.
Identification: If interaction A depends on data B and B is abnormal (null/empty) yet A still executes, it is redundant.
Mitigation: Add guard logic to skip interactions when dependent data is invalid.
Real‑world case: A session list was empty, yet the system still queried marketing discounts, unnecessarily loading the service.
4. Non‑Reentrant Interface
Risk Definition: The same request processed multiple times may yield inconsistent results or duplicate writes.
Identification: Record‑replay tools can verify idempotency.
Mitigation: Front‑end debounce, interface‑level locks, and database unique constraints.
Real‑world case: A merchant card recharge lacked idempotency; repeated NMQ retries caused multiple credits.
5. Unreasonable Timeout Settings
Risk Definition: Timeout values are not aligned with actual service performance.
Impact: Overly long timeouts cause hanging connections; downstream timeouts longer than upstream lead to cascading delays and possible snowballing.
Identification: Look for timeout configurations that exceed typical service latency (e.g., DB connect 1 s, read 5 s).
Mitigation: Set downstream timeout < upstream timeout, tune values based on real measurements.
Real‑world case: Redis timeout set to >2 s caused thread blockage and system crash.
6. Improper Retry Configuration
Risk Definition: Retry count or interval does not reflect system capacity.
Impact: Too few retries cause business failures; too many amplify load and can trigger snowballing.
Identification: Inspect framework retry settings and code‑level limits.
Mitigation: Align retry numbers with business needs, avoid retries for weak dependencies, and cap retries for high‑cost operations.
Real‑world case: Uniform 3‑retry policy across all upstream‑downstream calls caused a 27× QPS surge during a BS hang.
7. Direct IP Connection
Risk Definition: Services connect to each other using hard‑coded IP lists.
Impact: Failure of a single node cannot be fail‑overed, expanding outage scope.
Mitigation: Use service discovery (BNS) or group‑based connections.
Real‑world case: Redis proxy IP list required manual traffic cut‑over; recovery took hours.
8. Cross‑Datacenter Requests
Risk Definition: Modules deployed in different data centers communicate directly.
Impact: Network latency degrades performance and success rate, harming stability.
Identification: Mis‑configured service tags or IDC mismatches cause fallback to default routes.
Mitigation: Verify configuration correctness and perform traffic‑switch drills.
9. Unreasonable Strong/Weak Dependencies
Risk Definition: Services are incorrectly classified as strong or weak dependencies.
Impact: Treating an unstable service as strong makes the whole chain fragile.
Mitigation: Re‑classify dependencies based on business criticality and add fallback or degradation logic.
Real‑world case: Cloud push was a strong dependency; its failure blocked order processing and caused massive NMQ retries.
10. Invalid Dependency Interference
Risk Definition: A service establishes a connection that is never used in the business flow.
Impact: Unnecessary dependency adds instability.
Mitigation: Remove such connections after code review.
11. Third‑Party Dependency
Risk Definition: External services (internal or external) are required to complete a request.
Impact: Their instability directly propagates to the host service.
Mitigation: Avoid strong third‑party reliance, set appropriate timeouts/retries, and implement circuit‑breaker and degradation.
12. Cache Penetration
Risk Definition: Requests for non‑existent keys bypass the cache and hit the backend repeatedly.
Impact: High concurrency can overwhelm the backend.
Mitigation: Cache empty results briefly, use a bitmap to filter impossible keys, and design fallback strategies.
13. Cache Expiration / Avalanche
Risk Definition: Many cache entries expire simultaneously, causing a surge of backend traffic.
Impact: Backend overload and possible system crash.
Mitigation: Stagger TTLs, use locking or queues for cache rebuild, and employ a two‑level cache.
14. Architecture Coupling
Risk Definition: Tight coupling between modules, interfaces, or message queues.
Impact: Unimportant functions can drag down critical ones.
Mitigation: Separate important/unimportant, real‑time/off‑line, online/offline components; use operational controls when refactoring is costly.
15. Cache Coupling
Similar to architecture coupling; improper sharing of cache layers leads to cascading failures.
Conclusion
The article lists fifteen common architectural design risks and provides definitions, impacts, detection methods, and mitigation strategies with concrete Baidu cases, encouraging engineers to audit their own systems early and improve overall stability.
Baidu Intelligent Testing
Welcome to follow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.