Cloud‑Native Migration of Tencent Happy Game Studio Backend Using Istio Service Mesh
The article details how Tencent's Happy Game Studio transformed its large‑scale, low‑utilization backend from a legacy distributed architecture to a cloud‑native, Istio‑enabled service‑mesh platform, achieving significant resource savings, smoother deployments, and improved observability across game, CGI, and storage services.
Chen Zhiwei, a senior backend expert at Tencent, leads the public backend R&D and team management for the Happy Game Studio, which operates a distributed micro‑service platform serving tens of millions of daily active users.
The legacy on‑premise architecture, inherited from QQGame, consists of dozens of self‑developed frameworks and hundreds of micro‑services, but suffers from low CPU utilization (average < 20%), fragmented service governance, cumbersome deployment, and high operational overhead.
To address these challenges, the team embraced a cloud‑native strategy, tightly integrating Kubernetes (K8s) and Istio. They introduced gRPC support, built a MeshGate bridge to connect cloud‑side mesh services with on‑premise services, and gradually migrated workloads without downtime.
Key outcomes of the migration include:
CPU utilization improved by 60‑70% due to pod‑level resource granularity and auto‑scaling.
Helm‑based declarative deployment and one‑click roll‑outs reduced operational effort.
Istio provided powerful service‑governance, observability, and traffic‑management capabilities.
For private‑protocol services, the team developed MeshGate to act as a bidirectional proxy, converting between gRPC and the original protocol, and deployed it alongside Envoy to leverage Istio’s control plane while preserving authentication, encryption, and connection management.
Performance tests showed that after integrating Envoy, private‑protocol forwarding latency remained comparable to on‑premise direct connections (average 0.62 ms vs. 0.38 ms), while pure gRPC over Istio incurred higher latency (average 6.23 ms), confirming the suitability of the hybrid approach for latency‑sensitive game traffic.
Scenario
Average Latency
P95 Latency
On‑premise direct
0.38ms
0.67ms
K8s pod‑to‑pod
0.52ms
0.90ms
Istio + TCP (private protocol)
0.62ms
1.26ms
Istio + gRPC
6.23ms
14.62ms
The GameSvr service, previously a monolithic game‑room server, was re‑architected to run on K8s with Istio mesh, achieving near‑zero downtime migration, a two‑thirds reduction in CPU and memory usage, and automated scaling based on load.
For the massive CGI services (≈350 instances), the team applied two strategies: high‑traffic CGIs were refactored to use coroutine‑based asynchronous handling with http‑parser and libco, while low‑traffic CGIs were containerized together with Apache and migrated in bulk, achieving up to 85% CPU and 70% memory savings.
The in‑house CubeDB storage, holding tens of terabytes across hundreds of MySQL tables, was migrated to Tencent's TcaplusDB via a Cube2TcaplusProxy that adapts the private protocol, enabling seamless data sync and lossless cut‑over.
Multi‑cluster deployment was realized by assigning different business teams to separate K8s clusters while sharing common services in a public cluster, with Istio control‑plane federation provided by Tencent Cloud Mesh (TCM) to enable low‑cost cross‑cluster communication.
In summary, through systematic analysis and cloud‑native refactoring, the Happy Game Studio achieved a smooth, high‑quality migration to a Kubernetes‑Istio mesh, gaining automated deployment, service discovery, elastic scaling, robust governance, and comprehensive observability, while dramatically improving reliability, maintainability, and operational efficiency.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
