Why Chengdu’s COVID Testing System Crashed and How to Build Resilient Backend Services

The article analyzes the Chengdu COVID‑19 testing system failure, outlining its architecture, estimating traffic, identifying infrastructure and software bottlenecks, and recommending sharding, message‑queue decoupling, comprehensive monitoring, and multi‑vendor coordination to build a more reliable backend platform.

dbaplus Community
dbaplus Community
dbaplus Community
Why Chengdu’s COVID Testing System Crashed and How to Build Resilient Backend Services

System Boundary Definition

The nucleic‑acid testing system serves medical staff and integrates with the public health‑code platform. Core workflow steps are:

Medical staff open the mobile app and input tube codes.

Staff scan residents' health‑code QR.

Throat‑swab specimens are collected.

Specimens are delivered to the testing centre.

Results are submitted to the nucleic‑acid system, which synchronises the status to the health‑code platform.

During the Chengdu outage the process stalled at steps 1 and 2.

Root‑Cause Analysis

The system is a government‑purchased (TO‑G) and citizen‑facing (TO‑C) SaaS solution hosted in a government‑cloud datacenter, with Neusoft as the primary integrator and multiple subcontractors. Three major failure vectors were identified:

Infrastructure layer : hardware faults, network bottlenecks, mis‑configured firewalls, or load‑balancer overload in the government‑cloud datacenter.

Network layer : insufficient bandwidth or firewall rules causing packet loss.

Application layer : the testing application’s capacity limits (e.g., thread‑pool size, database throughput) could not sustain peak load.

Application‑Layer Design & Capacity Estimation

Two independent methods were used to estimate peak request rates:

Population‑based estimate : Chengdu ~20 million residents. Assuming all are tested within 6 hours, average concurrency ≈3.5 million per hour ≈1 000 RPS; peak ≈2‑3 × higher → 2 000‑3 000 RPS.

Testing‑point estimate : ~15 000 testing sites (similar to Shanghai). Each site processes a registration every 10‑15 seconds with two parallel lanes, yielding a similar 2 000‑3 000 RPS peak.

Although request concurrency is modest, daily data volume is massive: up to 12 million samples per day, ≈300 million records per month. This mandates database sharding and partitioning.

Typical request flow :

Health‑code scan triggers an API‑gateway request.

The nucleic‑acid service validates the request, caches site/batch metadata, and assembles the persistence payload.

Data is written via a sharding middleware to the appropriate database shard.

To improve resilience, the design proposes inserting a message queue (e.g., Kafka) between step 2 and step 3, decoupling registration from downstream processing and providing traffic smoothing.

Message‑Queue Integration

Front‑end receives an immediate success response after publishing the registration event to the queue.

Consumers read the event, invoke the sharding middleware, and persist the record.

Consumers also push a status update to the health‑code service (asynchronously marking the test as completed).

Key MQ considerations include exactly‑once delivery, high availability, and proper back‑pressure handling.

Monitoring Architecture

Effective monitoring is split into three layers:

Basic operations monitoring : CPU, memory, disk, network I/O, load, TCP connections; alerts based on predefined thresholds.

Application‑level monitoring : request latency (TP99, TP999, AVG, MAX), method‑call counts, error rates, JVM health (heap, GC pause, thread count).

Business monitoring : end‑to‑end workflow health (e.g., registration success rate, scheduled job execution); alerts when business‑critical processes fail.

When the government‑cloud experiences hardware or network issues, the basic‑ops layer surfaces the problem first, enabling rapid remediation.

Multi‑Vendor Coordination Challenges

The system involves the government, the primary integrator (Neusoft), and several subcontractors (telecom, security, etc.). Lack of top‑level coordination leads to fragmented performance testing, where each vendor validates only its own component. This siloed approach made it difficult to detect capacity limits before go‑live.

Recommendations

Adopt a message‑queue + sharding architecture to maximise throughput and avoid data loss.

Ensure the message queue provides high availability and exactly‑once semantics (e.g., Kafka with idempotent producers and transactional consumers).

Design sharding with hot‑cold data separation and consider downstream data‑warehouse pipelines for analytics.

Deploy comprehensive monitoring (infrastructure, application, business) within the government‑cloud environment to detect anomalies early.

Establish a joint performance‑testing framework that includes all vendors, enabling end‑to‑end load testing and capacity planning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringScalabilitySystem Designcloud
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.