Operations 10 min read

How We Cut Game Server Downtime from 1.5 Hours to 0.3 Hours

This article details how a Tencent game operations team reduced a major online game's scheduled maintenance window from 1.5 hours to just 0.3 hours by redesigning the checklist, separating pre‑ and post‑maintenance tasks, and switching to a rename‑based update method across thousands of servers.

Efficient Ops

Jul 10, 2016

How We Cut Game Server Downtime from 1.5 Hours to 0.3 Hours

1. Background

Game server maintenance is traditionally a dull, time‑consuming process. By continuously reviewing checklists and asking whether each step truly belongs in the downtime window, the team iteratively reduced maintenance time from 1.5 hours to 0.3 hours. The lessons apply to PC games, mobile games, web services, and ERP systems.

2. Downtime Optimization (1.5 h → 0.3 h)

2.1 Process Optimization

The team examined the maintenance checklist and asked two key questions:

Can this step be completed in the 10 minutes before the outage?

Can it be postponed until after the servers are back online?

A checklist is a pre‑maintenance list of tasks. The simplest form is an Excel sheet; a more advanced version is an online system that automatically triggers actions at the appropriate stage.

2.1.1 Analyzing the Critical Path

By separating tasks that can be done before or after the outage, the team shortened the critical path. The diagram below shows the before/after comparison.

Animated versions illustrate how steps moved from the downtime critical path to pre‑ or post‑maintenance phases.

2.1.2 Time Savings

The table below shows the time saved for each sub‑step.

Overall, process optimization cut the maintenance window from 1.5 hours to 0.5 hours, prompting the team to seek further gains.

2.2 Rename‑Based Update

Originally, patches were applied by copying large game assets (tens of gigabytes) to each server, causing severe I/O bottlenecks when thousands of machines updated concurrently.

2.2.1 Why Not Use mv?

On Linux, moving a directory (mv) merely changes the name in the filesystem metadata without touching data blocks, while copying (cp) recreates inodes and data, consuming far more I/O.

Benchmarks showed a 30 GB copy taking roughly 30,000 times longer than a rename.

Linux stores the mapping between inodes and filenames; mv updates only the name entry, leaving the inode unchanged.

Additional diagrams illustrate inode behavior and hard‑link counts.

On Windows/NTFS, renaming a directory also updates only the MFT entry, making it a near‑instant operation.

2.2.2 Preparation Before Maintenance

Synchronize the live business directory (CURRENT) to a temporary directory (OLD) via rsync, apply the patch to OLD, then rename it to NEW (the version to be released).

2.2.3 Rename Operation During Maintenance

During the brief outage, perform a single rename: change CURRENT to OLD and NEW to CURRENT.

So easy! An animated GIF further clarifies the steps.

2.2.4 Eight‑Fold Efficiency Gain

Patch deployment per server dropped from 20 minutes to 1 second. Across thousands of servers, total patch time fell from 25 minutes to 3 minutes, an eight‑fold improvement.

3. Methodology Consolidation

After solving the problem, document the methodology:

Process Optimization: Include checklist analysis in the game operations manual, encouraging teams to move non‑essential steps out of the downtime window.

Rename‑Based Update: Ensure the update method is optimal and prepared before the outage.

By applying both strategies, the maintenance window shrank from 1.5 hours to 0.3 hours. Take action and achieve similar results.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations process optimization game server downtime reduction rename update

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Background

2. Downtime Optimization (1.5 h → 0.3 h)

2.1 Process Optimization

2.2 Rename‑Based Update

2.2.2 Preparation Before Maintenance

2.2.3 Rename Operation During Maintenance

2.2.4 Eight‑Fold Efficiency Gain

3. Methodology Consolidation

Efficient Ops

How this landed with the community

Was this worth your time?

0 Comments

2. Downtime Optimization (1.5 h → 0.3 h)