Operations 30 min read

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

This article details Tencent Search’s end‑to‑end stability engineering framework, covering a layered reliability architecture, disaster‑recovery mechanisms, fast detection and monitoring, emergency response acceleration, pre‑release interception, automated defense, and collaborative governance that together improve MTTD and MTTR by an order of magnitude.

ITPUB

Jun 30, 2023

How Tencent Search Supercharged Reliability: Inside Its Stability Governance Playbook

Overview

Tencent Search (大搜) transformed its reliability posture by treating stability as a product, systematically reducing mean time to detection (MTTD) and mean time to recovery (MTTR) while scaling availability across millions of queries.

Reliability Architecture

The architecture emphasizes redundancy at every level: multi‑region, multi‑active, and multi‑instance deployments. Critical services run in at least two data centers; traffic can be shifted instantly between them. The design also separates compute from storage, allowing stateless services to scale independently.

Disaster Recovery

Four primary cut‑over capabilities are built:

DNS‑level cut‑over – redirects whole domain traffic to a healthy region within minutes.

NGINX‑level cut‑over – switches traffic at the service granularity in about one minute.

Mid‑platform routing – uses the 北极星 routing plugin to route specific traffic slices between the 搜狗 and kd back‑ends.

Disaster cache (SearchGuard) – a cold‑standby cache that serves read requests when the main path is unavailable.

These mechanisms are orchestrated by a unified SearchGuard module that writes continuously under normal operation and switches to read‑only mode during an outage.

Detection and Monitoring

Monitoring is organized into six metric families:

Black‑box KPI alerts – probe‑based checks that trigger phone alerts on 5XX responses.

Business metrics – core output indicators such as result‑rate, hot‑search count, and relevance.

Functional metrics – good‑case tests covering both interactive and non‑interactive user flows.

Statistical metrics – PV/UV, click‑through, latency, and error‑rate trends.

Engineering metrics – success rates, error ratios, and service‑level health of critical nodes.

Infrastructure metrics – network latency, packet loss, DNS health, and CDN performance.

All alerts are routed to a corporate WeChat group for immediate visibility, enabling sub‑five‑minute detection.

Emergency Response

The emergency workflow accelerates five steps: fast reporting, rapid intervention, immediate stop‑loss, swift decision making, and quick recovery. A dedicated on‑call commander coordinates the response, while pre‑built playbooks automate actions such as cut‑over, service isolation, and experiment suspension.

Interception and Defense

Before code reaches production, a multi‑tiered interception process validates changes:

Pre‑release sandbox – runs smoke tests without real traffic.

CD tiered rollout – progresses from single machine to full data‑center exposure.

Automated good‑case verification – generates test cases from templates and runs them automatically.

Quality gates – enforce code review, unit‑test coverage, and diff verification before merge.

These steps catch ~90 % of incidents early, reducing downstream impact.

Automation and Tooling

A command protocol drives degradation actions. Example syntax:

// Command protocol
1 key = value   // key is xxDegrade, value is a pipe‑separated list of commands
2 A command may have up to three levels
3 Commands are separated by '|'
4 Level‑2 parameters use ':'
5 Multiple level‑2 parameters are joined by '&'
6 Level‑3 parameters are joined by '#'

// Example
xxDegrade = sgZhiling1|sgXX:1&2&3|kdXX:15

Additional automation includes request hashing and queuing to eliminate duplicate work, rate limiting per IP, and circuit‑breaker logic that triggers after a configurable error threshold.

Governance and Collaboration

Roles are clearly defined (response team, war‑room, commander, operator, decision‑maker). Regular monthly drills, case collection forms, and statistical analysis of MTTD/MTTR, interception rate, and miss‑rate ensure continuous improvement. The "six‑question" crisis model guides post‑mortems to pinpoint problem, impact, control, source, nature, and resolution.

Conclusion

By combining layered redundancy, fast detection, automated cut‑over, rigorous pre‑release checks, and disciplined governance, Tencent Search achieved an order‑of‑magnitude improvement in availability metrics. The practices outlined are applicable to any large‑scale service seeking to mature its reliability engineering.