How Tencent’s BlueKing Platform Automates Ops: Key Takeaways from the Efficient Operations Talk
This article summarizes a detailed Q&A from the Efficient Operations talk, covering BlueKing’s integration with databases, agent resource management, alarm de‑duplication, automation workflows, development language choices, data handling, and the platform’s suitability for various enterprise environments.
Participants
cloudliu@Tencent Games
Shaoyongqiang@HW‑Shenzhen
Zhang Lei@Zhejiang University‑Hangzhou
Wang Qi@Wall Street Journal‑Shanghai
Editors
Cheng Cheng@iPulse (content collection & article organization)
Xiao Tianguo (review & publishing)
Guest Introduction
Dang Shouhui, Tencent Games BlueKing Product Center Director, has years of ops team management experience, designing automated operation systems for large‑scale game platforms and now leads the BlueKing ops support system, focusing on industry‑grade unattended operations and data‑driven value‑added solutions.
Part 1: BlueKing and Database Collaboration
Q1: How do you handle relational database updates? A1: A dedicated DBA team provides a unified system with APIs for updates; if no team, you can either encapsulate updates into scripts for atomic calls or pause the node until DBA finishes.
Q2: How many DBAs and databases are managed? A2: Over 20 DBAs maintain nearly 20,000 DB instances, also handling storage development such as TMySQL and TSpider.
Q3: Does the DBA team use BlueKing monitoring? A3: DBAs use a specialized system called GCS, which offers DB monitoring and operations and exposes APIs to BlueKing.
Part 2: Agent and Data Collection
Q4: How does BlueKing consider resource usage and isolation for agents? A4: Agents typically use less than 1% of system resources; if usage exceeds a threshold three times consecutively, the agent self‑terminates, with a 5‑second check interval.
Q5: How is unstructured data from custom agents automatically analyzed? A5: Reported data follows a fixed prefix; the remaining fields are defined by ops using a YAML‑based syntax. Changes in format require updating the YAML logic.
Part 3: Alarm and Automated Handling
Q6: How are alarms de‑duplicated, merged, and aggregated? A6: Each alarm carries a timestamp and type; a convergence mechanism matches them against a time‑based rule set tailored to business characteristics.
Q7: Does the automation run on false‑positive alerts? A7: Automation logic includes detection nodes; if an alarm is a false positive, the detection node blocks execution.
Q8: How does BlueKing achieve alarm count convergence despite unknown root causes? A8: An event library stores time‑tagged information, and a rule‑matching library filters alarms based on business‑specific configurations.
Q9: How are cascading alarms handled? Q9: Within a time window, related alarms are converged; a fault tree determines the primary cause. If more than three identical remediation plans are detected, the system pauses for manual confirmation.
Q10: Are hundreds of thousands of alerts displayed daily? A10: Daily alerts reach tens of thousands; the UI groups them by category and business, making them manageable.
Q11: How is the alarm logic tree constructed? A11: Ops must understand their architecture and troubleshooting methods; new alarms that lack automated recovery require manual handling.
Q12: Why does the reported alarm accuracy appear high? A12: Accuracy is inflated because many low‑severity alerts are counted as “handled.” The true metric is the self‑healing rate, about 94.25%.
Q13: What are the core principles of the automated recovery system? A13: The pipeline is alarm capture → field completion → convergence filtering → event library analysis → automatic handling. The most complex steps are convergence filtering and event analysis.
Q14: Are alarms turned into work orders and automatically closed? A14: Only a few converged alarms generate work orders; repaired alarms can notify users and are tracked via health analysis.
Part 4: Development Language and Cost
Q15: Are the underlying components written in Python or Go? A15: The control platform core is C (including agents); most other services are Python, making it one of the largest Python teams in Shenzhen. The data platform uses Java for Storm extensions.
Q16: What technologies handle T‑level data and real‑time analysis? A16: The platform relies on Kafka and Storm for second‑stage development, processing ~1.3 TB per monitoring item daily (≈2.5 billion records), with total daily volume around 60 billion records, stored in Elasticsearch and a custom Hermes system.
Q17: How many developers and ops staff were involved? A17: The first phase was prototyped by four ops members without development. After the PaaS prototype, about ten developers joined for phase two; later, additional staff were added for stability, security, and platform enhancements.
Part 5: Data Entry and Consistency
Q18: How is asset management automated for physical devices? A18: BlueKing Config Platform acts as a cloud‑native CMDB; operations embed API calls to modify configurations, with strict approval workflows for app deployment.
Q19: Is there a transaction mechanism for failed atomic service steps? A19: On failure, the workflow pauses at the failed node and notifies ops; users can retry or skip the step via the BlueKing UI.
Part 6: Other Topics
Q20: Is BlueKing suitable for traditional ERP systems on AIX or Solaris? A20: Currently, the agent does not support those operating systems.
Q21: How were ops staff affected by BlueKing adoption? A21: Basic ops tasks became automated, allowing teams to shift toward value‑added services; the ops team size actually grew.
Q22: Does the public‑cloud version support call‑center integration? A22: No, that capability is not available at present.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.