How Tencent Scales Automated Operations for Massive Services
Tencent’s architecture platform team explains how they monitor, automate, and secure billions of daily operations across storage, CDN, and live services, using multi‑dimensional metrics, real‑time and instant computation, AI‑driven anomaly detection, and a custom control platform for safe changes.
Massive Service Scale
Tencent provides services such as WeChat image sharing, QQ music, Tencent Games, App Store downloads, COS object storage, video on demand, live streaming, and more, with total storage exceeding 2 EB, bandwidth over 100 Tb, and more than 200 000 servers across 1 000+ data centers, handling over 90% of Tencent’s outbound traffic with only 50 operations staff.
Monitoring Challenges
Like a power plant, large‑scale services need strong, real‑time monitoring to detect abnormal metrics. Tencent faces huge monitoring volume, frequent changes, and high security requirements.
Service Architecture
The platform consists of two main layers: Storage (handling user uploads/downloads with access, logic, and distributed storage layers) and CDN (multi‑level edge nodes, intermediate sources, and origin servers). Images illustrate the architecture.
Automated Operations System
The system is divided into three parts: 1) foundational systems (configuration, resource management, billing); 2) generic operations capabilities (monitoring, change management, PaaS operations platform, testing, workflow); 3) business‑specific operations (photo album, COS, VOIP). This article focuses on the generic capabilities.
1. Massive‑Scale Monitoring
Traditional monitoring missed issues such as missing alerts for friend‑circle image failures. Tencent’s solution adds a reporting module that aggregates data at second, minute, and log levels, supporting structured time‑series, detailed logs, and custom business metrics. Over 600 million minute‑level structured data points are reported per minute.
Metrics are modeled as multi‑dimensional, multi‑indicator data (e.g., flow, latency, request count, failure count) with dimensions like region, carrier, and image size. A KV store holds (key, value) pairs where the key is a feature ID and the value stores a 2‑hour time series of 120 points, compressed to about 350 GB per day.
To enable fast queries on massive dimensions, Tencent combines real‑time aggregation with instant computation and an automatic indexing mechanism that creates on‑demand indexes for hot dimensions, achieving sub‑second query responses even for hundreds of thousands of dimension combinations.
Long‑tail traffic is handled by a two‑stage process: real‑time analysis writes low‑volume data to HBase, then a Spark job runs every five minutes to promote traffic‑heavy items to priority monitoring, while still detecting anomalies for long‑tail services within five minutes.
For anomaly detection, Tencent employs a two‑stage AIOps pipeline: first, unsupervised statistical algorithms (Grubbs, EWMA, least squares, First Hour Average) vote on potential outliers; then a supervised model (RF, GBDT, SVM) trained from operator‑labeled alerts removes noise (“spikes”). Alerts are routed to responsible engineers via WeChat for feedback, continuously improving the models.
Automatic analysis tools are built on a PaaS operations platform, allowing engineers to drag‑and‑drop scripts and create custom analysis pipelines without deep code changes.
2. Secure and Efficient Change Management
For large fleets, traditional SSH/expect or tools like Ansible become impractical. Tencent built a control platform with two core functions—command execution and file transfer—augmented with job workflow, templating, and isolation mechanisms. It uses a P2P‑style file transfer that falls back to optimal relay paths when direct connections fail.
The platform manages over 300 000 terminals and schedules more than 5 000 jobs daily, becoming the sole method for production operations and prohibiting direct SSH access.
Change workflows include automated build, gray‑scale testing, batch rollout, and post‑change monitoring using machine‑learning models to detect abnormal load or error spikes, automatically pausing rollout if issues arise.
Post‑change verification is performed by lightweight tools hosted on the same PaaS platform, ensuring each business’s specific checks are executed without embedding them into the change system.
3. Production‑Machine Security System
Security is enforced by separating high‑risk operations (template‑based, audited, rate‑limited) from normal tasks, and by restricting direct login privileges (ordinary vs. root). Only 1 % of operations involve direct login, typically for urgent fault handling.
The control platform enforces template approval, execution frequency limits, and business‑scope isolation, while the overall system balances efficiency and safety through a three‑layer hierarchy.
Conclusion
Tencent’s DevOps practice demonstrates how massive, heterogeneous services can be operated with fine‑grained monitoring, automated anomaly detection, safe change management, and robust security, enabling reliable delivery at internet scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
