Tencent Cloud Database Massive Operations: Team Building, Automated Operations Platform, and Intelligent Practices
Tencent Cloud Database’s massive‑operation strategy combines a dedicated architect team, a three‑layer automated platform for resource, task and health management, and AI‑driven intelligent services that customize workloads, automate tuning, and enable proactive scaling and self‑healing across hundreds of thousands of instances.
Author Introduction: Lu Yue, leader of the Tencent Cloud Database Architecture team, responsible for pre‑sale architecture, operation, and tuning of MySQL, Redis, Oracle, etc.
The experience of massive database operation is divided into three parts: (1) Building the database architect team, (2) Constructing an automated operation platform, (3) Practicing intelligent massive operation.
1. Building the Database Architect Team
Reason for formation: Diverse customer scenarios make it hard for a single pre‑sale architect to master every use case, and post‑sale engineers also lack deep expertise across all products, leading to service quality issues. Hence a dedicated architect team was created.
Division of labor: The service system adopts a three‑layer architecture – the first layer handles platform stability (operations), the second layer consists of architects who drive difficult tasks such as database construction and tool development, and the third layer is frontline service engineers handling consulting and workflow issues.
The team’s work covers four areas: customer operation, solution delivery (basic and industry‑specific), service system (platform and product operation), and platform construction (customer operation platform, solution export, supporting service system).
One concrete product is the CDB WeChat assistant, which provides proactive push and pull capabilities to help frontline staff serve customers better.
2. Construction of the Automated Operations Platform
With over 100,000 instances and 20,000 physical machines, a stable automated platform is essential. The platform provides functions such as resource management, operation tasks (upgrade, quota control), monitoring (performance and availability), and self‑healing (automatic detection and recovery of common issues).
Platform Architecture
The platform consists of three layers from bottom to top: client (APP entry), middle layer (customer entry), and backend. Users can access the platform via website or API to perform resource management, instance management, data transfer, monitoring, alerts, reporting, and holistic monitoring. All operation data is aggregated into an operational database for further big‑data analysis.
Monitoring Module
The monitoring module has two main branches: performance monitoring and availability monitoring. It uses a DB master and a probing server (拨测Svr). Performance monitoring of CDB instances is handled by the cdb_report module, feeding data into the Apd Netman module.
Challenges in massive‑scale probing include probe server performance and single‑point‑of‑failure risks. The design addresses these by optimizing request latency, scaling probe servers, and providing disaster‑recovery deployment.
3. Intelligent Massive Operation Practice
Current limitations of the platform include lack of customized services for different industry scenarios and insufficient automatic diagnosis and tuning.
To address this, Tencent is developing an intelligent product that leverages data mining and architect‑customer communication to profile database usage and provide customized services.
Customization Services
Four typical workload types are identified:
Compute‑intensive applications (e.g., BI reports) – enable off‑peak resource over‑provisioning.
Storage‑intensive applications – provide automatic loading and compression engines.
Traffic‑intensive applications – offer self‑service SQL optimization tools.
Hotspot applications (e.g., news, red‑packet services) – support dynamic instance scaling before peak periods.
Database Automatic Tuning
Real‑time and predictive analysis of instance historical data are used to generate quality scores. By adjusting weightings for CPU, storage, etc., and applying big‑data/AI models, the system can automatically optimize databases or suggest improvements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
