Operations 8 min read

How Tencent’s Public Gateway Overcomes Extreme Availability Challenges

The article details Tencent's Public Gateway (TGW) architecture, its forwarding and control planes, and presents two real‑world extreme failure cases— a NIC batch bug and a special IPv6 packet causing core dumps—along with the multi‑level disaster‑recovery design and mitigation strategies employed to ensure high availability.

Efficient Ops
Efficient Ops
Efficient Ops
How Tencent’s Public Gateway Overcomes Extreme Availability Challenges

GOPS 2023 Conference Overview

On October 26‑27, 2023 the 21st GOPS Global Operations Conference was held in Shanghai, gathering over 80 experts from institutions such as the China Academy of Information and Communications Technology, major banks, securities firms, and telecom and insurance industries to discuss DevOps, AIOps, SRE, CT, security and related topics.

Tencent Public Gateway (TGW) Overview

TGW (Tencent Gateway) is Tencent's self‑developed IaaS platform that provides external network egress and load‑balancing capabilities for its data centers and IDC.

Products built on TGW include Elastic IP and Load Balancer, serving major Tencent services such as WeChat, QQ, Tencent Video, Honor of Kings, Peace Elite, and most Tencent Cloud customers.

Overall Architecture

TGW architecture is divided into a forwarding plane and a control plane.

Forwarding Plane

The forwarding plane processes incoming packets, performing decapsulation, encapsulation, and dispatching them to clients or RS. Certain functions, used only by CLB, involve scheduling algorithms (e.g., WR, WLC) that dynamically forward requests to RS and automatically detect and exclude abnormal RS instances.

Control Plane

The control plane manages user rules and operational systems, including:

Console : instance purchase, RS binding/unbinding, weight adjustment.

Monitoring Platform : reporting bandwidth, QPS, connection count for billing and monitoring.

Rate‑Limit Center : real‑time throttling for bandwidth packages and performance‑guaranteed instances.

Scheduling System : capacity management and automatic load balancing to improve cluster utilization.

Cluster Management : handling cluster scaling, up/down, and LD expansion.

Extreme Scenario Cases

Case 1: NIC Batch Bug

In 2019 a cluster suffered a malformed‑packet attack that caused servers to lose connectivity, resulting in a 46‑minute outage. The root cause was a specific batch of NICs that failed under the attack, rendering out‑of‑band management ineffective because NC‑SI shared the same NIC.

Solution: introduce hardware heterogeneity by using different vendors for the same component across machines (e.g., CPU from vendor A on some machines and vendor B on others) so that at least 50% of the cluster remains operational, and provide a completely independent NIC for out‑of‑band management in newer models.

Case 2: Special Packet Triggering Core Dump

In September 2022 an IPv6 packet with an extension header triggered a program core dump, lasting 50 minutes. Monitoring quickly detected the anomaly, but only device restarts were possible, which did not resolve the underlying issue. Removing the offending rule from the core file stopped the recurrence.

Mitigation strategy (four steps): perception → loss control → anomaly定位 → blocking capability. By detecting the anomaly, the system can downgrade and drop the special packets, which account for only about 5% of total traffic.

Disaster‑Recovery Architecture of the Forwarding Plane

TGW clusters are evenly deployed across Zone A and Zone B. Within each zone, devices are placed on two switches and two racks, achieving multi‑level redundancy:

Single‑machine redundancy via ECMP using OSPF/BGP routes.

Switch and rack redundancy through multiple supporting units.

AZ redundancy: normal traffic flows through Zone A (25‑bit routes); if Zone A fails, traffic automatically switches to Zone B (24‑bit routes).

Unicast synchronization between devices ensures long‑connection continuity during failover or scaling.

operationsdisaster recoveryavailabilityTencent Cloudnetwork gateway
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.