Operations 21 min read

How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE

This article details Tencent CDN's challenges and solutions for business continuity, covering bandwidth and device resource constraints, massive request handling, fault‑management lifecycle, automation bottlenecks, and the implementation of AIOps, intelligent alerts, capacity planning, and root‑cause analysis to ensure reliable service.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent CDN Achieves Seamless Business Continuity with AI‑Powered SRE
Huang Xiaohua, from Tencent Cloud Architecture Platform, leads the CDN SRE system, evolving it from tooling and standardization to automation and now toward intelligence, with extensive experience in business continuity.

Note: This article is organized from Huang Xiaohua's talk at GOPS Global Operations Conference 2022 Shanghai.

Tencent CDN Business Continuity Challenges

Tencent CDN faces significant business continuity challenges.

Bandwidth reserve : 150 TB of bandwidth across 2,000+ global CDN nodes covering major and minor ISPs, creating complex cross‑border and cross‑operator network environments. Device resources : 5 million CPU cores, 85 device models, 10 disk types; disks are classified into ten speed tiers to match diverse business needs, increasing operational complexity. Massive requests : Peak CDN QPS exceeds 100 million per second, spanning video on demand, static content, downloads, live streaming, dynamic acceleration, and security acceleration, using protocols such as IPv4/IPv6, HTTP/HTTPS, H2, QUIC, etc.

Specific challenges include:

Complex business scenarios : From ultra‑high‑availability financial services to high‑bandwidth gaming (e.g., Honor of Kings, Peace Elite) and highly stable live streaming. Customer sensitivity : CDN is the first hop for end‑users; any incident triggering three customer complaints automatically creates a fault ticket. Strict fault grading : Any SLO impact over 1% lasting ten minutes is graded and counted toward team OKRs. High operational complexity : With 5 million cores, even a 99.9% availability yields >5,000 hardware anomalies daily, not counting network fluctuations, software bugs, or attack traffic.

Business Continuity Built on Fault Management

How does Tencent CDN ensure continuity?

The core idea is a fault‑management‑centric continuity model, dividing the fault lifecycle into three stages: prevention, handling, and root‑cause elimination .

Fault prevention receives the most effort, aiming to reduce fault occurrence, delay, or eliminate faults altogether.

Fault handling is split into discovery,定位 (location), and recovery.

Fault root‑cause elimination involves post‑mortem analysis, evolution measures, and cultural/assessment improvements.

Key metrics are MTBF (Mean Time Between Failures) for prevention/root‑cause phases and MTTR (Mean Time To Recovery) for handling phases.

Tencent CDN Business Continuity System Overview

The overview illustrates the defensive‑stability approach in the prevention stage, assuming all hidden risks will eventually manifest and designing mitigation for each risk point.

Defensive measures include comprehensive monitoring with early‑warning, tiered architecture with worst‑case assumptions, disaster‑recovery designs, and chaos‑engineering validation.

Fault handling emphasizes staying calm under pressure.

Fault handling · Discovery : Aim for 10 s detection at IaaS layer, 1 min at PaaS layer; improve alert accuracy to avoid alert fatigue; collaborate with front‑end teams for rapid escalation.

Fault handling · Location : Use the “Lingge” system for one‑click on‑call, meeting, and group creation; maintain “Lingxi” knowledge base for automated analysis and, where possible, self‑healing.

Fault handling · Recovery : Assign an experienced fault commander to coordinate business notification, problem diagnosis, execution of recovery playbooks, and double‑check to prevent secondary faults.

In the root‑cause stage, the guidance is culture first, graded severity, and precise quantification.

Continuous vigilance is required even when MTBF is long; culture initiatives (e.g., 100‑day security operations, chaos drills) and precise grading of work orders ensure hidden risks are tracked and mitigated.

Automation‑Era Bottlenecks

Remaining challenges include:

Prevention: Detecting subtle anomalies with traditional threshold alerts; extracting common issues from diverse customer tickets; manual capacity forecasting.

Discovery: Alert silos and information overload during incidents.

Location: Multi‑module, multi‑team environments prolong root‑cause identification.

Recovery: Limited scenario coverage in traditional chaos engineering.

Root‑cause: Need for richer scenario control mechanisms.

Business Continuity Perspective on AIOps

AIOps integrates Observability, Analysis, and Automation to enhance quality, efficiency, and cost in SRE.

Observability : Full‑stack logging, metric monitoring, and tracing platform.

Analysis : Intelligent analysis of network, business, and performance data combined with expert knowledge.

Automation : Expert‑knowledge‑driven decision making for known scenarios; algorithmic prediction for hidden risks.

Intelligent Alert System

Smart alerts detect patterns such as local spikes, jitter intensity changes, and gradual mean shifts that traditional threshold alerts miss.

Common Issues in Consultation Tickets

AI‑driven semantic analysis extracts keywords from customer tickets, instantly identifying common anomalies and escalating them to the full SRE chain, reducing MTTR.

Intelligent Capacity Planning

Version 1.0 optimizes cost‑first global solutions but cannot respond in real‑time to spikes. The self‑training system pre‑computes capacity scenarios, enabling near‑real‑time adjustments and achieving quasi‑real‑time planning.

Root Cause Analysis

A data‑centric platform aggregates network, performance, business, tracing, scheduling, health, and client data, linking them across the business view. Continuous expert knowledge accumulation across domains (video, live, cache, scheduling, cost, resources) fuels fast, accurate root‑cause identification.

Fault Root, Intelligent Automation Continuous Iteration

Three smart‑automation examples:

SSD lifespan prediction : Predict wear‑out, monitor write rates, proactively migrate workloads, and isolate aging disks before failure, extending SSD life by ~8 months.

Link quality adaptation : Build regional quality maps and apply localized thresholds, achieving “one‑size‑fits‑all” routing without over‑ or under‑alerting.

LDNS chunk scheduling : Aggregate top‑domain bandwidth demand, pool resources, and use virtual platforms for fine‑grained load balancing, reducing manual platform splitting.

Summary: Intelligent Operations Boost Tencent CDN Business Continuity

The illustrated system demonstrates how AI‑driven operations enhance reliability, efficiency, and cost‑effectiveness for Tencent CDN.

Future work will continue expanding AIOps capabilities to further improve business continuity.

Future is here—embrace intelligence and usher in the AIOps era!

MonitoringAutomationOperationsSRECDNAIOpsBusiness Continuity
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.