Cloud Computing 8 min read

Inside Tencent's Game Cloud: Architecture, Challenges, and Solutions

Tencent's internal game cloud, the company's largest IaaS platform, faces six major challenges—including massive device counts, frequent game launches, utilization fluctuations, high availability, performance demands, and fault handling—addressed through a three‑layer architecture, five key technical capabilities, and a robust operational system.

21CTO
21CTO
21CTO
Inside Tencent's Game Cloud: Architecture, Challenges, and Solutions

Tencent's internal game cloud is the largest component, so the most challenging problems, latest technologies, and most efficient operation systems appear there first. Feng Liang, director of the TEG Architecture Platform Department, reveals the underlying cloud technology.

Tencent logo
Tencent logo

Characteristics and Challenges of the Internal Game Cloud

The internal game cloud accounts for about 70‑80% of Tencent's devices, so its contradictions stem from game cloud issues. It faces six major problems:

Large device count : It occupies 70% of total devices, including physical and parent/child machines, creating challenges for operations, fault handling, and performance.

Frequent game launches and merges : Opening and merging servers are routine, requiring automation and high efficiency.

Utilization fluctuations : CPU and NIC utilization vary with player activity; machines should rest when players do.

High availability : Both parent/child machines and infrastructure must stay highly available; low availability harms service quality and revenue.

Performance speed : Server architecture aims to maximize parent machine performance.

Fault handling : Hardware failures require agile detection and resolution, with priority separation for critical services.

Underlying Technical Architecture

The architecture consists of three layers: the application/virtualization portal layer, the virtualization control layer (subdivided into VM provisioning, VM operation, and VM management/query), and the physical resources layer (servers, network, storage).

Architecture diagram
Architecture diagram

The key technical capabilities are divided into five areas: access, provisioning, scheduling, performance assurance, and stability assurance.

Provisioning capability : Each virtual child machine is provisioned within 10 seconds, with limits based on physical parent resources, leveraging KVM/XEN optimizations.

Scheduling capability : Supports scaling up/down of single machines and cluster-wide resource utilization, including cold and hot migration.

Performance assurance : Produces 500 child machines in under 30 seconds (average minutes), introduces SR‑IOV to improve NIC throughput and reduce parent CPU load, and modifies XEN code for better VHD disk I/O.

Stability assurance : The control platform uses stateless multi‑instance design for 99.9% availability, applies dozens of patches to XEN 4.2 (both community and internal), and addresses Windows VM stability on KVM with PV driver patches.

Operational System

Without effective operations, technology cannot be applied. A high‑efficiency operational system supports the technology loop, covering monitoring/alerting, change management, quality management, resource management, and reporting, built on three platforms: network management, job scheduling, and operation management.

Key operational capabilities include resource operation, fault detection, and fault handling.

Resource operation : Automated management, monitoring, and alerting of parent machines.

Fault detection : Five‑level (user, platform, module, process, system) monitoring ensures rapid fault discovery and alerts via email, SMS, WeChat, or phone, with differentiated severity handling.

Fault handling : First‑in‑industry hot‑patch mechanism for XEN VMM allows kernel faults to be patched online without reboot.

Problem Sharing

Issues mainly focus on the virtualization layer (KVM/XEN):

XEN's xen_spin_unlock fails to wake waiting VCPU, causing domain0 deadlock.

CFQ scheduler's async write requests are starved by numerous sync requests, leading to async write starvation.

Child VMs sending IGMP Query packets trigger a Cisco switch bug, causing packet loss for Windows VMs.

Blktap driver does not clean pending I/O when the tapdisk process exits abnormally, leaving the process in D state.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

IaaSTencentGame Cloud
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.