Operations 9 min read

Why Xi'an’s One‑Code Pass Crashed: Analyzing System Overload and Scaling Fixes

On December 20 the Xi'an health‑code app "One‑Code Pass" suffered a massive outage as a sudden traffic surge overwhelmed its query‑heavy backend, exposing network bottlenecks and a lack of scaling mechanisms, prompting a detailed technical analysis and proposed architectural remedies.

21CTO
21CTO
21CTO
Why Xi'an’s One‑Code Pass Crashed: Analyzing System Overload and Scaling Fixes

1

On December 20, Xi'an’s health‑code system "One‑Code Pass" collapsed, leaving users unable to scan codes for over 15 hours and causing massive disruption for commuters and travelers.

2

Product analysis

The original version displayed personal name, ID, and a green/yellow/red health code after a single query. This simple design required only one SQL query, but later revisions added vaccination and nucleic‑acid test information, increasing the number of queries to at least three.

The service handles a massive amount of read‑heavy traffic: over 90% of requests are queries. With Xi'an’s 13 million residents, even a 10% simultaneous scan rate would generate about one million concurrent requests.

3

Technical analysis

At around 07:40 on December 20, the "One‑Code Pass" user traffic surged to more than ten times the usual peak, causing network congestion and making the application unusable. The backend and database were normal; the problem was identified on the network interface side.

The official response suggested users avoid unnecessary scans, but the root cause was not a DNS error or simple bandwidth saturation.

4

Personal analysis

The system experienced classic overload: request volume exceeded the server’s processing capacity. Two primary remedies exist—rate limiting and scaling.

Rate limiting blocks excess traffic (e.g., via Nginx) while scaling adds servers or expands database capacity. The incident showed that the team chose a rollback rather than immediate scaling, indicating a lack of proper capacity planning.

5

Ideal solution

1. Read‑write separation & caching : Split the system into a read‑only service handling the bulk of queries and a separate service for updates (vaccination, test results). Cache frequently accessed data to protect the database.

2. Sharding & service decomposition : Partition data by user ID (e.g., 64 tables or micro‑services) and distribute traffic across them, reducing load on any single instance.

3. Big‑data & disaster recovery : Use an asynchronous pipeline to sync data into a NoSQL table for fast reads, and deploy services across multiple data centers with failover mechanisms.

These steps would allow the system to handle sudden spikes without collapsing.

6

Conclusion

The crash was a human‑error issue: the system was put into production without rigorous testing, and its architecture did not anticipate rapid scaling. Proper capacity planning, modular design, and robust disaster‑recovery strategies are essential to prevent similar failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

rate limitingscalingsystem overload
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.