Operations 11 min read

From Firefighter to Automation: Tencent’s Ops Veteran Shares 10‑Year Infrastructure Secrets

Veteran Tencent operations leader Zhao Jianchun recounts a decade of managing 100,000 servers, detailing the L5 fault‑tolerant system, unified framework, resource packaging, CMDB virtual imaging, and an automated deployment platform that together cut daily incidents by up to 90% and boosted efficiency tenfold.

dbaplus Community
dbaplus Community
dbaplus Community
From Firefighter to Automation: Tencent’s Ops Veteran Shares 10‑Year Infrastructure Secrets

Fault‑tolerant Service Discovery (L5)

The L5 system works like a DNS‑style service registry combined with an agent layer. Each service module registers its instances (IP + port) together with runtime metrics such as success rate and latency. The L5 agent continuously aggregates these metrics and adjusts a weight for every instance:

If an instance’s failure rate exceeds a threshold, its weight is reduced or the instance is ejected, providing immediate fault isolation.

When the failure rate is only slightly degraded, the weight is lowered to achieve a gray‑release style rollout.

Weight‑based routing automatically balances load among healthy instances.

Because routing, load‑balancing, gray‑release and fault‑tolerance are all driven by the same data, the system can reduce daily incidents by 80‑90 % and eliminates the need for frequent IP + port changes.

L5 fault‑tolerant architecture
L5 fault‑tolerant architecture

Unified Communication Framework

Network communication is abstracted into a standard framework that separates the transport layer from business logic. The ingress layer uses QZHTTP to accept external requests, while the business logic is compiled as SO dynamic libraries (similar to CGI) and runs on the SPP and SF frameworks. This clear separation yields:

Lower learning cost for new engineers.

Higher stability because the transport stack and business code evolve independently.

Cross‑business code reuse and a potential ten‑fold increase in operational efficiency.

Unified framework diagram
Unified framework diagram

Standardized Resource Packaging

Every service artifact is packaged into a uniform "box" that provides the same lifecycle commands:

install   # deploy files, create directories, set permissions
uninstall # clean up all resources
start      # launch the service process
stop       # gracefully terminate the process

This approach makes deployment, rollback and scaling predictable and repeatable across all services.

Resource packaging diagram
Resource packaging diagram

CMDB‑based Virtual Service Image

All resources required by a module (databases, caches, message queues, configuration files, etc.) are recorded in a second‑level CMDB. The CMDB entry forms a "virtual image" that captures the complete dependency graph of the service. Benefits include:

Instant query of a module’s full operational footprint.

Elimination of separate documentation – the CMDB is the single source of truth.

Facilitated rapid decision‑making and automated scheduling.

CMDB virtual image
CMDB virtual image

Automated Deployment & Decision Scheduling Platform

The internal platform (named “织云”) orchestrates the full service lifecycle in 23 visualized steps:

Resource request – allocate machines, network, storage.

Package retrieval – pull the boxed artifact from the resource repository.

Pre‑deployment checks – verify dependencies, run static analysis.

Deploy – copy files, register the instance with L5, start the process.

Post‑deployment verification – ensure the process is alive and health checks pass.

Business‑level testing – run functional smoke tests.

Release – mark the deployment as successful (green) or failed (red); pending steps remain gray.

Rollback – if any step fails, the platform automatically reverts to the previous version.

… (additional steps cover monitoring integration, alarm configuration, capacity planning, etc.)

The platform can be triggered manually or by an automated policy, enabling rapid scaling, routine drills and graceful rollbacks.

Automated deployment workflow
Automated deployment workflow

Overall Three‑Tier Architecture

Traffic flows through three logical layers:

Ingress layer (TGW) – receives external requests.

Middle layer (L5) – performs service discovery, weight‑based routing and fault isolation.

Storage layer – unified under an Access‑based standard, providing consistent data‑service interfaces.

Three‑tier architecture diagram
Three‑tier architecture diagram

Evolution of Operational Standards

Adopting the above practices was a gradual, multi‑stage process that required:

Tooling development (L5 agents, CMDB extensions, deployment engine).

Process redesign (formalizing the 23‑step deployment pipeline).

Cultural shift – moving from reactive “fire‑fighter” mode to proactive reliability engineering.

Ops standards evolution
Ops standards evolution

In summary, the combination of a fault‑tolerant service discovery layer (L5), a unified communication framework, standardized resource packaging, a CMDB‑driven virtual service image, and an automated deployment platform dramatically reduces incidents, simplifies maintenance, and enables efficient scaling of large‑scale online services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

automationScalabilityfault toleranceInfrastructureCMDB
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.