Operations 13 min read

Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure

The article analyzes the Xi’an “Yima Tong” health‑code system outage, detailing the symptoms, root‑cause factors such as rate‑limiting gaps, server overload, architectural coupling, and ISP differences, and then offers short‑term, long‑term, design, high‑availability, and testing recommendations to prevent future crashes.

Programmer DD

Dec 22, 2021

Why Did Xi’an’s Health‑Code App Crash? A Deep Dive into the Failure

Problem Description

1. Health code page blank after scanning QR code.

2. Some users see 502 Bad Gateway.

3. Nucleic acid test report not displayed.

4. During recovery, China Telecom network can open health code while China Mobile cannot.

Root Cause Analysis

Main Issues

Rate limiting problem: Users repeatedly refresh, increasing load, indicating lack of rate limiting.

Server overload: Peak traffic exceeds server capacity, causing crash.

Architecture problem: Modules tightly coupled, possibly not micro‑service based.

Performance overload: Database or network bottlenecks lead to overload.

Scenario problems: Large data queries monopolize resources; peak‑hour traffic spikes overwhelm database.

Design flaws: No high‑concurrency testing or pressure testing before release.

Other Issues

nginx backend server likely crashed under high concurrency, possibly cache breakdown.

Load balancer overloaded; lack of dynamic DNS caused single‑machine network card saturation.

Different ISP DNS paths caused inconsistent access (Telecom vs. Mobile).

Disaster‑recovery and fault isolation insufficient; SLA >12 hours.

Possible hardware load balancer (F5) failure; missing gateway‑level rate limiting.

Solution Recommendations

Product Suggestions

Isolate business modules with high coupling into independent services.

System Suggestions

Short‑term

Page optimization with friendly waiting messages.

Implement request debouncing and caching (e.g., 24‑hour cache for nucleic‑acid results).

Separate critical rendering data from non‑critical, using aggregated APIs.

Merge requests, reduce concurrent calls.

Compress transferred data to lower latency.

Decouple asynchronous requests.

Long‑term

Business abstraction and module isolation for high cohesion, low coupling.

Simplify data models for fast callbacks.

Interface segregation with single‑responsibility services.

Adopt micro‑frontend architecture for independent deployment.

Component library reuse to shrink project size.

Build a middle‑platform to avoid direct backend calls.

Apply diff algorithm on the presentation layer to avoid unnecessary renders.

Implement crash alerts for rapid response.

System Design Suggestions

Architecture: Move to micro‑services, service mesh, cloud‑native elasticity (K8s), preferably on private cloud.

Middleware: Use TiDB for distributed SQL, Redis Cluster for high‑availability caching.

Tiered Management: Prioritize critical services on better hardware, isolate them from less critical ones.

CDN Caching: Cache static resources to reduce backend load.

Security: Harden data‑center according to banking standards, close unused ports.

Alerting: Monitor availability, disk, CPU, memory with timely alerts.

Network Availability: Monitor multi‑region connectivity.

High‑Availability Design

Service and data redundancy (read/write separation, possible ClickHouse for read‑heavy queries).

Robust load balancing (LVS + nginx + dynamic DNS, or hardware LB like F5).

Hot data caching with consistency strategies.

Graceful degradation and throttling for non‑critical paths.

Asynchronous processing and fast‑fail mechanisms.

DNS‑level load balancing with keepalived.

Testing Suggestions

Add high‑performance automated stress tests before release.

Conduct regular disaster‑recovery drills.

These analyses and recommendations aim to improve the stability and scalability of the “Xi’an Yima Tong” health‑code service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance cloud-native system reliability incident analysis

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.