How We Cut Game Server Downtime from 1.5 Hours to 0.3 Hours
This article details how a Tencent game operations team reduced a major online game's scheduled maintenance window from 1.5 hours to just 0.3 hours by redesigning the checklist, separating pre‑ and post‑maintenance tasks, and switching to a rename‑based update method across thousands of servers.
1. Background
Game server maintenance is traditionally a dull, time‑consuming process. By continuously reviewing checklists and asking whether each step truly belongs in the downtime window, the team iteratively reduced maintenance time from 1.5 hours to 0.3 hours. The lessons apply to PC games, mobile games, web services, and ERP systems.
2. Downtime Optimization (1.5 h → 0.3 h)
2.1 Process Optimization
The team examined the maintenance checklist and asked two key questions:
Can this step be completed in the 10 minutes before the outage?
Can it be postponed until after the servers are back online?
A checklist is a pre‑maintenance list of tasks. The simplest form is an Excel sheet; a more advanced version is an online system that automatically triggers actions at the appropriate stage.
2.1.1 Analyzing the Critical Path
By separating tasks that can be done before or after the outage, the team shortened the critical path. The diagram below shows the before/after comparison.
Animated versions illustrate how steps moved from the downtime critical path to pre‑ or post‑maintenance phases.
2.1.2 Time Savings
The table below shows the time saved for each sub‑step.
Overall, process optimization cut the maintenance window from 1.5 hours to 0.5 hours, prompting the team to seek further gains.
2.2 Rename‑Based Update
Originally, patches were applied by copying large game assets (tens of gigabytes) to each server, causing severe I/O bottlenecks when thousands of machines updated concurrently.
2.2.1 Why Not Use mv?
On Linux, moving a directory (mv) merely changes the name in the filesystem metadata without touching data blocks, while copying (cp) recreates inodes and data, consuming far more I/O.
Benchmarks showed a 30 GB copy taking roughly 30,000 times longer than a rename.
Linux stores the mapping between inodes and filenames; mv updates only the name entry, leaving the inode unchanged.
Additional diagrams illustrate inode behavior and hard‑link counts.
On Windows/NTFS, renaming a directory also updates only the MFT entry, making it a near‑instant operation.
2.2.2 Preparation Before Maintenance
Synchronize the live business directory (CURRENT) to a temporary directory (OLD) via rsync, apply the patch to OLD, then rename it to NEW (the version to be released).
2.2.3 Rename Operation During Maintenance
During the brief outage, perform a single rename: change CURRENT to OLD and NEW to CURRENT.
So easy! An animated GIF further clarifies the steps.
2.2.4 Eight‑Fold Efficiency Gain
Patch deployment per server dropped from 20 minutes to 1 second. Across thousands of servers, total patch time fell from 25 minutes to 3 minutes, an eight‑fold improvement.
3. Methodology Consolidation
After solving the problem, document the methodology:
Process Optimization: Include checklist analysis in the game operations manual, encouraging teams to move non‑essential steps out of the downtime window.
Rename‑Based Update: Ensure the update method is optimal and prepared before the outage.
By applying both strategies, the maintenance window shrank from 1.5 hours to 0.3 hours. Take action and achieve similar results.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.