Automation and Operational Management for Large-Scale Infrastructure at the Architecture Platform
The article explains how the Architecture Platform team builds a comprehensive, automated operations system—including CMDB, cost budgeting, monitoring, permission management, self‑service tools, and mobile access—to safely and efficiently manage tens of thousands of servers and massive storage services.
Before introducing our operation system, it is necessary to briefly describe the business and its characteristics to help readers understand why the operation system is designed this way.
Figure 1 Business characteristics served by the Architecture Platform.
The Architecture Platform (referred to as "架平") supports Tencent’s massive storage and CDN scenarios, such as WeChat chat images & videos, Moments images & short videos, QZone album media, Tencent Video on‑demand, live streaming, Tencent Cloud, Weiyun files, etc., operating tens of thousands of servers, hundreds of data centers, dozens of Tbps download bandwidth, and exabyte‑level storage.
TFS‑type storage provides file‑system‑like services (e.g., Weiyun files, Moments images). TDB‑type storage offers KV services for QZone feeds, posts, and TFS index data. Details of TFS/TDB will be covered in other articles.
Figure 2 Architecture Platform operation system.
The operation system consists of five major components: basic configuration CMDB, cost budgeting, reporting & workflow & testing, quality monitoring, and live‑network operations, forming a complete operation body that ensures safe, reliable, and efficient business operation.
CMDB: Basic configuration management covering devices, data centers, and services, providing functions such as pre‑registration, acceptance, high‑risk port management, and device retirement.
Cost budgeting: Since the department handles massive storage and CDN services, a rigorous cost control and accounting system is essential.
Reporting & workflow & testing: A unified reporting system, workflow for change management and incident follow‑up, and automated testing for quality assurance.
Quality monitoring: Real‑time monitoring of hundreds of thousands of servers and thousands of services, with second‑level alerts to responsible personnel.
Live‑network operations: For a few dozen machines simple scripts (ssh+expect) suffice, but for tens of thousands of servers across various carriers and overseas data centers (e.g., AWS) a dedicated system is required.
Today we focus on live‑network operations in massive device and service scenarios.
Automation Operations: Background
Automation operations discuss production‑machine related tasks such as scaling, changes, and anomaly analysis, which involve logging into production machines, modifying files, and executing commands.
When the service has only dozens of machines, simple Excel + ssh can manage changes; early QQ backend used this approach.
When scaling to hundreds of machines and diversified services, a standardized CMDB and tools like expect or Ansible become necessary.
When scaling to tens of thousands of machines globally, open‑source automation tools are insufficient, requiring a custom operation management system.
Our Business Requirements for Automation Operations
The ideal goal is to achieve maximum efficiency while maintaining safety; this tension is addressed by building an automated operation system that reduces manual intervention.
Automation Operations System Construction
Figure 3 Production‑machine permission management system.
We use the TEG security platform’s "Iron General" plus our own hierarchical permission system to control server access.
Iron General provides login authentication and shell command interception. Our permission system manages virtual business platforms, permission groups, staff‑group mappings, CMDB‑module relationships, and command whitelists, pushing this data to Iron General.
When a user logs in via ssh, Iron General checks login permission and, after login, validates each command against the whitelist, blocking unauthorized commands.
This isolation prevents cross‑business access, limits root‑like privileges, and balances flexibility with security.
Figure 4 Self‑service operation system.
Directly logging into production machines is inefficient; our self‑service system encapsulates common operations (tools) and business‑specific workflows (processes) with security grading, allowing batch execution across multiple machines.
Security grading defines risk levels: high‑risk operations require approval each time; low‑risk operations have daily execution limits without approval, with excess requiring approval.
Common operations (e.g., adding or removing crontab entries) are packaged as tools; business‑specific processes combine tools to achieve complex tasks like one‑click module deployment or automated anomaly handling.
Figure 5 One‑click deployment of a business module.
Different business modules may require additional steps such as TGW (Tencent’s external load balancer) or CL5 (Tencent’s name service load balancer) provisioning and high‑risk port registration; dedicated processes enable secure, efficient one‑click scaling.
Figure 6 Full workflow for anomaly detection, analysis, and handling.
During massive operations, hardware failures, network anomalies, and software bugs are common. Our end‑to‑end pipeline (monitoring → analysis → automatic handling) can automatically trigger analysis and, based on results, either invoke automated remediation or notify responsible personnel, achieving fully automated handling in many cases.
Disk failures are frequent; our automated disk‑replacement workflow handles discovery, service pause, data migration, on‑site replacement notification, confirmation, initialization, and reintegration, with only the physical replacement step requiring manual effort.
In summary, we introduced the Architecture Platform’s self‑service operation system, production‑machine permission management, and several safety‑oriented, high‑efficiency automation practices.
Mobile Access
In the mobile era, on‑site or off‑site incidents can be addressed via the enterprise‑level mobile app, allowing users to interact with production machines from their phones, eliminating the need to return to a computer.
Tencent Architect
We share insights on storage, computing, networking and explore leading industry technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
