Operations 16 min read

How Tencent Scaled Health Code: Cloud‑Native Architecture, Monitoring, and Chaos Engineering

This article reviews how Tencent Health Code handled billions of daily scans by adopting cloud‑native architecture, comprehensive observability, capacity stress testing, chaos engineering, and disciplined change control to ensure high availability and resilience as pandemic demand waned.

Efficient Ops

Feb 20, 2023

How Tencent Scaled Health Code: Cloud‑Native Architecture, Monitoring, and Chaos Engineering

Reading note: As pandemic control models evolve, the daily active users of health code decline, marking the end of its mission; this article, by Tencent R&D engineer Li Xiongzheng, reviews the technical architecture, observability, and operational safeguards used in health code services.

1. Business Background

After three years of pandemic, the Omicron wave is waning. Tencent Health Code served residents in over ten provinces, handling billions of scans and hundreds of billions of page views, supporting public health safety. With DAU decreasing, the service is winding down.

2. Technical Architecture

A stable architecture is built through design and operations, focusing on several aspects.

1) Choosing appropriate cloud‑native products

The health code requires high availability and concurrency, so Tencent Cloud public and private cloud solutions were adopted. Key challenges included:

Bandwidth capacity – insufficient public‑cloud egress and IDC connection limits were mitigated with DDoS protection, WAF, and ECDN.

Development and deployment efficiency – rapid iteration was enabled by Tencent Cloud TCB, a one‑stop cloud‑native platform that streamlines project creation and integration.

Cloud resource cost – serverless functions (SCF) and multi‑AZ designs reduced idle costs while providing cross‑AZ disaster recovery.

2) Three‑dimensional monitoring design

A comprehensive monitoring system provides early warnings and post‑incident analysis, helping SREs intervene before failures expand.

3) Stress testing, chaos engineering, and emergency drills

Regular pressure tests, chaos experiments, and incident drills verify system resilience and team response quality.

3. Observability System

Observability is essential for rapid fault detection and root‑cause analysis. Health code leverages Tencent Cloud WAF, Intelligent Gateway, TKE, and other PaaS components to emit standardized logs, which are cleaned and aggregated into metrics. Front‑end instrumentation and component‑level exporters (Prometheus, Telegraf) provide additional visibility.

1) Base component observability

Public cloud dashboards, alerts, and open‑source exporters (Grafana plugin, Prometheus Qcloud exporter) enable integration with Prometheus/Grafana.

2) Business metric observability

Key indicators such as scan volume, success rate, and latency are monitored with thresholds, dynamic alerts, and percentile calculations, using log analysis tools like Elasticsearch or Tencent Cloud CLS.

3) User‑experience observability

Front‑end monitoring uses Tencent Cloud RUM, which requires minimal code injection to track JS errors, white screens, first‑paint time, API success rates, and latency.

import Aegis from 'aegis-mp-sdk';
const aegis = new Aegis({
  id: "pGUVFTCZyewxxxxx",
  uin: 'xxx',
  reportApiSpeed: true,
  spa: true,
});

RUM dashboards give SREs an overview of user‑experience data, API success rates, and error trends.

4. Capacity Stress Testing

During pandemic spikes, traffic can surge many times the normal level. Anticipating growth, the team estimates demand, coordinates scaling with owners, and conducts performance tests for both read and write interfaces.

1) Read‑side stress testing

Simulated traffic at multiples of peak load is generated while informing third‑party APIs to avoid overload.

2) Write‑side stress testing

Data‑write tests use marked synthetic users or request headers, with post‑test cleanup to avoid polluting production data.

3) Stress‑test troubleshooting

Common bottlenecks include single‑core saturation, high CPU, firewall limits, PAAS constraints, and third‑party latency; identifying these during tests guides capacity improvements.

5. Chaos Engineering and Failure Drills

Chaos experiments such as shutting down instances or disabling network interfaces verify high‑availability design, monitoring coverage, and the effectiveness of emergency plans like auto‑scaling and rate limiting.

6. Architecture Optimization and Flexibility

Techniques like queue‑based write buffering, short‑term caching, front‑end retry limits, and gateway rate limiting protect downstream services and improve resilience.

7. Change Control

Post‑deployment stability relies on rigorous change management, using Tencent Cloud Coding for workflow, requiring detailed change requests, reviews, and gray‑release strategies.

Overall, the health code service demonstrates how cloud‑native architecture, integrated observability, and disciplined operations can sustain a high‑traffic public service through a pandemic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native SRE capacity testing

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.