Operations 32 min read

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

The article details Tencent Search’s end‑to‑end stability engineering practice, covering a ten‑step architecture that combines redundancy, proactive detection, rapid emergency response, automated cut‑over, defensive caching, and continuous drills, and shows how these measures collectively reduced mean‑time‑to‑detect and mean‑time‑to‑recover by an order of magnitude while keeping service availability high.

Architect

Dec 22, 2023

When a serious incident strikes, the Search team at Tencent discovered that their services often lacked visibility, the problem could not surface, and response actions were slow, leading to prolonged outages. To address this, they designed a comprehensive stability framework that spans six major dimensions: availability architecture, disaster‑recovery, detection, emergency handling, interception, defense, and collaboration.

1. Availability Architecture

They introduced multi‑region, multi‑active, and multi‑instance deployment for all critical services, ensuring that a failure in a single data center or link does not cascade to the whole system. For example, the offline index pipeline was split into independent zones, and Kafka redundancy was added to prevent a single‑point failure.

2. Disaster‑Recovery (High‑Availability Pillar)

Key capabilities include:

Redundant deployment across regions ("multi‑active").

Fine‑grained instance scaling (e.g., reducing CPU from 8 cores to 4 cores for low‑traffic services while increasing replica count).

Three‑tier cut‑over mechanisms: DNS‑level cut‑over (5‑minute propagation), Nginx‑level cut‑over (1‑minute propagation), and mid‑platform routing cut‑over using the internal North‑Star router.

Graceful degradation paths (L2 manual control vs. L4 automatic control) were evaluated, and the team chose the L2 approach for its flexibility and lower risk.

3. Detection – Simplifying the Art of Finding Issues

The team built a six‑layer monitoring system (black‑box, business, functional, statistical, engineering, and infrastructure metrics) that reduced MTTD by ten‑fold. A KPI probe that continuously issues search queries and triggers a phone alarm when 5XX responses appear is a concrete example of a black‑box metric that directly drives cut‑over decisions.

4. Emergency – Efficiency Accelerator

The emergency workflow is broken into five rapid actions: fast reporting (automated alerts to enterprise WeChat groups), fast intervention (pre‑assigned on‑call leads and one‑click meeting rooms), fast stop‑loss (pre‑built cut‑over, experiment pause, and service isolation capabilities), fast decision (SRE‑level authority or delegated commander), and fast recovery (quick rollback, service warm‑up, and experiment emergency stop). The table of “speed‑up actions” is reproduced as a list:

Report fast: Automated alerts eliminate manual notification.

Intervene fast: Consensus on commander, one‑click arena, and phone contact list.

Stop loss fast: Pre‑built disaster‑recovery tools (cut‑over, experiment pause, service isolation) enable barrier‑free execution.

Decide fast: Highest‑level personnel or SRE makes the decision, guided by golden rules (first‑time cut‑over).

Recover fast: Dedicated tracing, quick rollback, optimized service start‑up, and experiment emergency mechanisms.

5. Interception – Four‑Lines‑Pull‑A‑Thousand‑Weight

Historical data shows that 90 % of incidents could be intercepted before release. The interception process includes pre‑release sandbox testing, CD‑level graded rollout (single node → single zone → all zones), and strict checklist enforcement. For a low‑traffic proxy service, the team reduced CPU allocation from 8 cores to 4 cores and increased replica count to guarantee availability during batch releases.

6. Defense – Strengthening the Core

Key defensive measures include:

Separating compute and cache for stateful services like SR, turning them into stateless services with side‑car cache.

Optimizing startup time for SH services (40 % reduction) and establishing fast rollback channels.

Introducing request‑hash queues upstream of the mid‑platform to deduplicate identical queries and avoid diff‑induced cascades.

Deploying distributed caches for high‑traffic, blocking services to provide graceful degradation when downstream failures occur.

Automatic degradation (query‑level throttling, fallback to local models) and circuit‑breaker logic with configurable thresholds (e.g., error rate > N in T seconds triggers circuit‑break and downstream rate‑limiting).

7. Continuous Improvement – Review, Blue‑Team, and Metrics

Every incident is recorded in a case‑pool with fields such as discovery time, intervention time, stop‑loss method, and recovery time. The team performs weekly post‑mortems, extracts common failure patterns, and updates the “three‑level” metrics (MTTD, MTTR, interception rate). Automated case collection scripts (shown below) enforce a standardized data model.

// Example of case‑collection fields
creator, department, CS‑level, date, description, impact, module, discoveryMethod, discoveryStage, emergencyTriggered, businessCategory, stopLossMethod, source, sourceTime, occurrenceTime, detectionTime, interventionTime, stopLossTime, recoveryTime, closureTime, caseMaterials, owner, ownerGroup, ownerDirector, interceptionEnabled, autoTested, MTTR, MTTD, interventionDelay, stopLossDelay, recoveryDelay, processViolation, recordTime

8. Governance Model – From Top‑Down to Bottom‑Up

The initial top‑down model required a central stability team to drive all initiatives, which proved costly and blind to many system nuances. The later bottom‑up model assigns a stability owner in each product team, defines three protection circles (global, team, module), and empowers owners to design, test, and roll out their own reliability solutions while reporting to the central stability office.

9. Automation – Large‑Scale Healing

Automation is woven throughout the workflow: on‑call alert routing, one‑click emergency dashboards, good‑case generation, automated rollback pipelines, and continuous integration checks for degradation rules. The diagram below illustrates the end‑to‑end automated pipeline.

In summary, Tencent Search’s stability engineering combines redundancy, proactive detection, rapid emergency response, automated cut‑over, defensive caching, and continuous learning to achieve an order‑of‑magnitude improvement in incident detection and resolution while maintaining high service availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Observability Incident Management Resilience Stability

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.