Why Xi'an’s One‑Code Pass Crashed: Analyzing System Overload and Scaling Fixes
On December 20 the Xi'an health‑code app "One‑Code Pass" suffered a massive outage as a sudden traffic surge overwhelmed its query‑heavy backend, exposing network bottlenecks and a lack of scaling mechanisms, prompting a detailed technical analysis and proposed architectural remedies.
1
On December 20, Xi'an’s health‑code system "One‑Code Pass" collapsed, leaving users unable to scan codes for over 15 hours and causing massive disruption for commuters and travelers.
2
Product analysis
The original version displayed personal name, ID, and a green/yellow/red health code after a single query. This simple design required only one SQL query, but later revisions added vaccination and nucleic‑acid test information, increasing the number of queries to at least three.
The service handles a massive amount of read‑heavy traffic: over 90% of requests are queries. With Xi'an’s 13 million residents, even a 10% simultaneous scan rate would generate about one million concurrent requests.
3
Technical analysis
At around 07:40 on December 20, the "One‑Code Pass" user traffic surged to more than ten times the usual peak, causing network congestion and making the application unusable. The backend and database were normal; the problem was identified on the network interface side.
The official response suggested users avoid unnecessary scans, but the root cause was not a DNS error or simple bandwidth saturation.
4
Personal analysis
The system experienced classic overload: request volume exceeded the server’s processing capacity. Two primary remedies exist—rate limiting and scaling.
Rate limiting blocks excess traffic (e.g., via Nginx) while scaling adds servers or expands database capacity. The incident showed that the team chose a rollback rather than immediate scaling, indicating a lack of proper capacity planning.
5
Ideal solution
1. Read‑write separation & caching : Split the system into a read‑only service handling the bulk of queries and a separate service for updates (vaccination, test results). Cache frequently accessed data to protect the database.
2. Sharding & service decomposition : Partition data by user ID (e.g., 64 tables or micro‑services) and distribute traffic across them, reducing load on any single instance.
3. Big‑data & disaster recovery : Use an asynchronous pipeline to sync data into a NoSQL table for fast reads, and deploy services across multiple data centers with failover mechanisms.
These steps would allow the system to handle sudden spikes without collapsing.
6
Conclusion
The crash was a human‑error issue: the system was put into production without rigorous testing, and its architecture did not anticipate rapid scaling. Proper capacity planning, modular design, and robust disaster‑recovery strategies are essential to prevent similar failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
