How Tencent Scaled Health Code: Cloud‑Native Architecture, Monitoring, and Chaos Engineering
This article reviews how Tencent Health Code handled billions of daily scans by adopting cloud‑native architecture, comprehensive observability, capacity stress testing, chaos engineering, and disciplined change control to ensure high availability and resilience as pandemic demand waned.
Reading note: As pandemic control models evolve, the daily active users of health code decline, marking the end of its mission; this article, by Tencent R&D engineer Li Xiongzheng, reviews the technical architecture, observability, and operational safeguards used in health code services.
1. Business Background
After three years of pandemic, the Omicron wave is waning. Tencent Health Code served residents in over ten provinces, handling billions of scans and hundreds of billions of page views, supporting public health safety. With DAU decreasing, the service is winding down.
2. Technical Architecture
A stable architecture is built through design and operations, focusing on several aspects.
1) Choosing appropriate cloud‑native products
The health code requires high availability and concurrency, so Tencent Cloud public and private cloud solutions were adopted. Key challenges included:
Bandwidth capacity – insufficient public‑cloud egress and IDC connection limits were mitigated with DDoS protection, WAF, and ECDN.
Development and deployment efficiency – rapid iteration was enabled by Tencent Cloud TCB, a one‑stop cloud‑native platform that streamlines project creation and integration.
Cloud resource cost – serverless functions (SCF) and multi‑AZ designs reduced idle costs while providing cross‑AZ disaster recovery.
2) Three‑dimensional monitoring design
A comprehensive monitoring system provides early warnings and post‑incident analysis, helping SREs intervene before failures expand.
3) Stress testing, chaos engineering, and emergency drills
Regular pressure tests, chaos experiments, and incident drills verify system resilience and team response quality.
3. Observability System
Observability is essential for rapid fault detection and root‑cause analysis. Health code leverages Tencent Cloud WAF, Intelligent Gateway, TKE, and other PaaS components to emit standardized logs, which are cleaned and aggregated into metrics. Front‑end instrumentation and component‑level exporters (Prometheus, Telegraf) provide additional visibility.
1) Base component observability
Public cloud dashboards, alerts, and open‑source exporters (Grafana plugin, Prometheus Qcloud exporter) enable integration with Prometheus/Grafana.
2) Business metric observability
Key indicators such as scan volume, success rate, and latency are monitored with thresholds, dynamic alerts, and percentile calculations, using log analysis tools like Elasticsearch or Tencent Cloud CLS.
3) User‑experience observability
Front‑end monitoring uses Tencent Cloud RUM, which requires minimal code injection to track JS errors, white screens, first‑paint time, API success rates, and latency.
<code>import Aegis from 'aegis-mp-sdk';
const aegis = new Aegis({
id: "pGUVFTCZyewxxxxx",
uin: 'xxx',
reportApiSpeed: true,
spa: true,
});
</code>RUM dashboards give SREs an overview of user‑experience data, API success rates, and error trends.
4. Capacity Stress Testing
During pandemic spikes, traffic can surge many times the normal level. Anticipating growth, the team estimates demand, coordinates scaling with owners, and conducts performance tests for both read and write interfaces.
1) Read‑side stress testing
Simulated traffic at multiples of peak load is generated while informing third‑party APIs to avoid overload.
2) Write‑side stress testing
Data‑write tests use marked synthetic users or request headers, with post‑test cleanup to avoid polluting production data.
3) Stress‑test troubleshooting
Common bottlenecks include single‑core saturation, high CPU, firewall limits, PAAS constraints, and third‑party latency; identifying these during tests guides capacity improvements.
5. Chaos Engineering and Failure Drills
Chaos experiments such as shutting down instances or disabling network interfaces verify high‑availability design, monitoring coverage, and the effectiveness of emergency plans like auto‑scaling and rate limiting.
6. Architecture Optimization and Flexibility
Techniques like queue‑based write buffering, short‑term caching, front‑end retry limits, and gateway rate limiting protect downstream services and improve resilience.
7. Change Control
Post‑deployment stability relies on rigorous change management, using Tencent Cloud Coding for workflow, requiring detailed change requests, reviews, and gray‑release strategies.
Overall, the health code service demonstrates how cloud‑native architecture, integrated observability, and disciplined operations can sustain a high‑traffic public service through a pandemic.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.