Technical Architecture, Observability, and Operational Practices of Tencent Health Code System
The article details how Tencent’s health‑code platform leveraged a cloud‑native, serverless architecture, extensive observability (Prometheus, Grafana, RUM), rigorous capacity testing, chaos engineering, and ITIL‑based change management to sustain billions of page views, support massive concurrency, and ensure reliable, scalable epidemic‑control services.
With the evolution of epidemic control models, the daily active users (DAU) of the health‑code service have gradually declined, indicating that the health‑code system is completing its historical mission. This article, authored by Tencent R&D engineer Li Xiongzheng, reviews the technical architecture, observability framework, and operational safeguards that supported the health‑code business.
Business Background
The health‑code service served residents in more than ten provinces, handling billions of page views and hundreds of billions of code displays. At its peak, the system recorded over 2 trillion page views and more than 1 billion daily active users across the country.
Technical Architecture
The system was built on Tencent Cloud’s public and private cloud solutions to meet high‑availability and high‑concurrency requirements. Key architectural considerations include:
Cloud‑native product selection : Use of DDoS protection, WAF, and ECDN to mitigate bandwidth constraints and large‑scale attacks.
Development and deployment efficiency : Adoption of Tencent Cloud TCB (a one‑stop cloud‑native platform) to accelerate development and reduce configuration overhead.
Cloud resource cost : Leveraging serverless functions (SCF) and multi‑AZ deployments (CLB, TKE, CKAFKA, TDSQL) to achieve cost‑effective scaling.
Observability System
A comprehensive monitoring stack was established to improve SLA and enable early fault detection. It consists of:
Infrastructure metrics : Logs from WAF, intelligent gateway, and TKE are collected, cleaned, and visualized via Prometheus and Grafana.
Front‑end instrumentation : Tencent Cloud RUM (Real User Monitoring) captures JS errors, page load times, API success rates, and latency with minimal code injection.
Component health checks : API, Telegraf, and exporter integrations monitor service health.
Example of RUM integration code:
import Aegis from 'aegis-mp-sdk';
const aegis = new Aegis({
id: "pGUVFTCZyewxxxxx", // project key
uin: 'xxx', // optional user ID
reportApiSpeed: true, // enable API speed reporting
spa: true // report page view on SPA navigation
});Capacity Testing and Chaos Engineering
To ensure the system can handle epidemic‑driven traffic spikes, both read‑side and write‑side load tests were performed. Strategies include:
Simulating tens of times peak traffic with random user sampling.
Marking test data for safe cleanup after write‑side tests.
Using Tencent Cloud’s global probing service for synthetic monitoring.
Chaos engineering exercises (e.g., instance shutdown, network interface disable, firewall failures) validated high‑availability, monitoring coverage, and emergency response plans such as auto‑scaling and rate‑limiting.
Change Control
All production changes go through a strict change‑request workflow on Tencent Cloud Coding, following ITIL change‑management practices and gray‑release strategies to minimize impact.
Architecture Optimization and Flexibility
Key practices to improve resilience include:
Queue‑based buffering for high‑concurrency writes (e.g., using CKafka).
Short‑term caching (5 minutes) at front‑end or back‑end to reduce load on third‑party APIs.
Front‑end rate‑limiting to prevent request storms during backend failures.
Gateway‑level rate limiting based on capacity testing results.
The article concludes that a well‑designed cloud‑native architecture, comprehensive observability, and disciplined operational processes are essential for maintaining the health‑code service’s reliability and scalability.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.