Operations 17 min read

TDSQL Intelligent Operation Platform – Bianque Architecture and Practice

Bianque, TDSQL’s intelligent operation platform, automatically collects and indexes database metrics, applies a knowledge‑base‑driven analysis engine to diagnose availability, performance and reliability issues, issue risk warnings and optimization recommendations, dramatically cutting DBA effort and support tickets across Tencent’s cloud services.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
TDSQL Intelligent Operation Platform – Bianque Architecture and Practice

At the 2019 China Database Conference (DTCC) held from May 8‑10, Tencent Cloud database engineer Lei Hailin delivered a technical talk titled “TDSQL Intelligent Operation Platform – Bianque Architecture and Practice”. The following is a full transcript of the presentation.

1. Introduction to Bianque

Bianque is a product launched by TDSQL for the cloud market that automatically analyzes database performance and fault problems and provides optimization or remediation solutions for users.

2. Demand background

TDSQL is Tencent’s high‑consistency, distributed database solution for financial scenarios. It now serves more than 90% of Tencent’s payment business and many government, banking, insurance, logistics, and e‑commerce customers on both public and private clouds. As the number of clusters and customers grows, operational challenges such as repetitive DBA work, slow fault diagnosis, and scalability issues become significant.

To address these challenges, an automated fault/performance analysis system is needed to reduce DBA repetitive labor, capture expert knowledge, quickly locate problems, and improve response speed and DBA satisfaction.

The module is named “Bianque” after the ancient Chinese physician, hoping it can diagnose database ailments and prescribe remedies.

3. Functions of Bianque

During development, Bianque’s knowledge base was continuously fed by DBA expertise. Most on‑line performance and fault issues can now be analyzed with a single click, greatly freeing DBA hands and improving operational efficiency. The core goals are:

Pre‑emptive risk warning

Accurate real‑time analysis and problem resolution

Post‑event historical analysis to discover hidden issues

4. System Architecture

Bianque consists of six layers (see diagram in the original slides):

Resource layer – provides raw information from DB instances and host machines.

Collection layer – gathers performance metrics, SQL logs, table schemas, etc., and forwards them to the storage layer.

Storage layer – persists the collected data for later historical analysis.

Index layer – extracts and classifies data from storage into programmable data structures required by the analysis layer.

Analysis layer – the core logic that combines indexed metadata with TDSQL’s knowledge base to perform root‑cause analysis and risk assessment for common anomalies such as master‑slave switch, replication delay, etc.

Presentation layer – visualizes analysis results as health reports and specific fault/performance/optimization suggestions.

5. Intelligent Diagnosis Principles and Practice

The platform classifies database issues into three categories: availability, performance, and reliability.

5.1 Availability Issues

Availability problems refer to periods when the DB cannot respond to user requests. TDSQL provides high availability, automatically switching to a new master when the current one fails. Bianque monitors a heartbeat table written by an agent; consecutive heartbeat failures trigger a switch, causing a brief (second‑level) outage. Bianque captures host‑level metrics (top, iotop, iostat) and DB snapshots (processlist, innodb_status) before the switch to identify root causes such as:

Kernel bugs causing DB restart

Disk failures

Resource‑intensive user SQL (slow‑query concurrency, large transactions)

For example, excessive concurrent slow queries can saturate InnoDB threads, leading to heartbeat timeouts and a master‑slave switch. Bianque detects this by analyzing innodb_status and processlist, aggregating slow‑query fingerprints to pinpoint offending SQL.

Large transactions (e.g., massive DELETE) generate huge binlog writes that block heartbeat writes, also causing switch timeouts. TDSQL now limits binlog size per write to 1.5 GB to mitigate this.

5.2 Performance Issues

Performance degradation is most visible as long‑running SQL statements. Common causes include network latency, inefficient SQL, resource saturation, and lock waiting.

For inefficient SQL, Bianque parses the statement, examines accessed tables and data distribution, and automatically generates index‑optimization recommendations.

When resources (CPU/IO) are saturated, Bianque’s session‑analysis aggregates sessions by SQL fingerprint, quickly identifying top‑resource‑consuming queries and linking them to optimization suggestions.

Lock‑wait problems are diagnosed by inspecting MySQL’s information_schema lock tables. Bianque identifies the “leader” session holding locks that block others and suggests terminating that session. It also handles cases where a transaction remains open for a long time, causing prolonged lock holding.

5.3 Reliability Issues

Reliability concerns involve hidden risks that may not yet manifest as failures. Bianque combines performance monitoring, schema analysis, historical sessions, slow‑query logs, and machine‑learning models on massive cloud data to assess DB health, issue early warnings, and reduce future incidents.

6. Case Studies

Several real‑world examples were presented:

Lock‑timeout case: a session held a row lock for >50 s, blocking another session. Bianque identified the blocking session via lock tables and recommended termination.

Large‑transaction case: a massive DELETE caused binlog blockage and heartbeat timeout; Bianque pinpointed the offending SQL and highlighted the binlog‑size limit mitigation.

Availability‑switch case: Bianque automatically correlated host metrics, innodb_status, and processlist to determine that concurrent slow queries triggered a master‑slave switch.

7. Summary

The presentation concluded that Bianque was born from TDSQL’s operational pain points, offering a layered architecture, automated diagnosis for master‑slave switches, lock‑wait analysis, and more. Since its deployment, performance‑related support tickets have dropped to near zero, greatly improving DBA productivity. Future work will further integrate AI and machine‑learning to predict and prevent a broader range of anomalies, aiming for proactive, pre‑emptive DB operation.

automationperformance monitoringfault-analysisDatabase OperationsTDSQLIntelligent Diagnosis
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.