Operations 19 min read

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

This article explains how Tencent Game's BlueKing platform redesigns operations by building open‑source PaaS capabilities, automating fault self‑healing, enabling fully automated game server region launches, supporting self‑service change releases, leveraging big‑data for real‑time decisions, and moving toward open‑source and hybrid‑cloud solutions.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

1. Fault Auto‑Recovery

Traditional manual fault repair has become a basic requirement; BlueKing implements automatic alarm handling using Fault Tree Analysis (FTA) where alerts are classified as critical alarms or pre‑warnings, each with processing or analysis logic.

Faults directly affect user experience and revenue; monitoring and automatic recovery are essential.

BlueKing provides a SaaS‑style "fault self‑healing" app that lets operators drag‑and‑drop to create fault logic trees for common alerts, and an IDE for custom complex logic.

During a test, a port alarm was automatically diagnosed as a dead process and restarted in 1 minute 13 seconds.

In the first half of 2015, BlueKing handled 3.31 million alerts, achieving a 100 % success rate for 3.03 million pre‑warnings and a 94.25 % success rate for 280 k alarms, saving over 10 k man‑hours.

2. Automated Game Server Region Launch

Game regions (servers) need frequent opening; BlueKing automates the entire workflow in four stages.

Stage 1: Automated Physical Deployment

Operators replace manual scripts with a BlueKing tool that calls atomic components to allocate resources and deploy servers.

Stage 2: Automated Environment Deployment

Additional steps such as time reset, test‑data cleanup, and website updates are scripted and integrated into the same tool.

Stage 3: Automated Decision Support

Product planners define opening rules; BlueKing’s data platform pulls real‑time metrics from IDC, computes whether a new region should be opened, and triggers the process automatically, with an optional manual confirmation step during testing.

Stage 4: Generic Launch Tool

After many game‑specific tools were built, experts consolidated common patterns into a universal launch tool that can be adopted across different games, reducing maintenance overhead.

3. Self‑Service Change Release

Similar automation applies to scaling, configuration changes, and deployments; any operation that can be expressed as Linux/Windows commands can be wrapped in BlueKing apps, allowing operators to provide solutions rather than manual labor.

Early BlueKing tools accounted for 90 % of operations; after standardization, they now represent less than 40 %.

4. Big‑Data‑Assisted Operation

BlueKing’s data platform streams real‑time metrics via Kafka and Storm, enabling multi‑dimensional monitoring, automated capacity expansion decisions, and product‑level user behavior analysis.

Detect simultaneous online increase and login drop to trigger alerts.

Calculate required server count based on CPU load and network traffic thresholds.

Segment users by download speed to target retention incentives.

5. Open Source and Hybrid‑Cloud Plans

BlueKing’s core components (configuration, job, and control platforms) have been deployed for gaming, finance, e‑commerce, and media customers, both on public and private clouds. An open‑source release of the configuration platform is planned for year‑end, pending internal review.

Current focus remains on internal deployments; large‑scale commercial private‑cloud offerings are not planned within the next six months.

Big DataautomationoperationsDevOpsfault recoverygame server launch
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.