How Tencent’s TGW Achieves Seamless Fast Migration and Self‑Healing Fault Recovery
The paper presents Tencent’s TGW cloud gateway architecture, highlighting a 2.9× forwarding performance boost, lossless state migration within 4 seconds, sub‑minute fault detection, multi‑level fault‑tolerance mechanisms, and operational best practices that enable 100 % availability for massive online services.
Background and Goals
Large‑scale cloud data centers are the backbone of the Internet. TGW (Tencent Gateway) integrates elastic public‑network access and intelligent load balancing to meet the rapid traffic growth and stringent latency requirements of online games, live streaming, and real‑time media. The system aims for ultra‑high performance forwarding, seconds‑level elastic scaling, intelligent high‑availability, and sub‑minute fault detection.
Architecture and Workflow
TGW follows a hierarchical modular design consisting of three parts:
Forwarding plane: Stateless TGW‑EIP (elastic public‑IP) and stateful TGW‑CLB (cloud load balancer).
Control plane: Global orchestrator, local operator, and distributed load distributor (LD).
Auxiliary components: BGP + ECMP routing, probing for fault detection, and logging agents.
Deployment places TGW‑EIP at the region entry and TGW‑CLB inside each availability zone (AZ). Inbound traffic is first routed via BGP to TGW‑EIP clusters, where NAT and tunnel encapsulation occur, then handed to TGW‑CLB for stateful load‑balanced forwarding.
Key Technologies
1. Efficient Forwarding Plane
Two forwarding models: TGW‑EIP uses a Run‑to‑Completion (RTC) model; TGW‑CLB combines Pipeline with RTC for stateful processing.
Core optimizations for TGW‑EIP: single‑core batch processing, hash‑lookup prefetch with sliding window (cache‑miss reduced from 20 % to 5 %), and minimal hash‑bucket prefetch.
Performance gains: single‑node throughput ↑ 53 % and latency stabilized at 66–105 µs, achieving 2.9× the throughput of the baseline Tripod solution for 512‑byte packets.
Core optimizations for TGW‑CLB: dynamic dispatch based on service identifiers (IP 5‑tuple, QUIC connection ID), lock‑free ring buffers, a 1:2 dispatch‑to‑processing core ratio, and throughput 2.9× Tripod.
2. State Migration Mechanism
TGW supports lossless hot migration, enabling seamless service continuity for latency‑sensitive workloads.
Hot vs. cold migration: Hot migration copies connection state without disrupting traffic; cold migration rebuilds connections, incurring latency.
Hot migration flow: The controller first copies stateless configurations (VIP‑DIP mappings) then dynamic connection state. After 90 % of state is transferred, the new cluster announces routes via BGP, and unrecognized packets are proxied to the old cluster. The entire process completes within 4 seconds.
State aggregation optimizations include VIP‑granular migration (avoiding per‑connection moves) and dedicated migration threads to decouple data‑plane processing.
3. Fault Recovery Mechanism
TGW employs a multi‑level fault‑tolerance model and dispersed migration to achieve second‑level recovery.
In‑cluster link synchronization: Only connections alive > 3 seconds are synchronized (reducing sync traffic by 70–80 %). Batch export aggregates packets up to MTU or 2 seconds before sending.
Performance: a single forwarding node can synchronize 130 M connections with a peak bandwidth of 350 Mbps.
Redundancy models: Active‑Active within an AZ, Active‑Standby across AZs, and DNS‑based failover across regions.
Dispersed migration shards affected VIPs into k = 10 buffer clusters; parallel migration reduces fault impact to 1/k and can shrink failure radius exponentially.
4. Fault Detection and Localization
TGW uses a colored‑mark probing system to locate faults within one minute.
TCP half‑open probes embed markers; responses are immediately reset to avoid resource consumption.
Probes run every 5 seconds with varying source ports.
Trace Point (TP) records packet path (e.g., forwarding node ID); Drop Point (DP) records drop reasons such as FLOW_LIMIT or TUNNEL_ENCAP_FAIL.
Case studies: a single node crash causes a sharp TP drop; node jitter triggers a DP surge.
5. Operational Experience
Eight years of production operation have yielded best practices across five dimensions.
Fault‑domain isolation: Hierarchical design limits impact—region‑level independent TGW instances, AZ‑level active‑standby, and cluster‑level VIP segmentation.
Redundancy principle (50 % redundancy): AZ pairs, rack‑level dual‑rack deployment, and machine‑level over‑provisioning to tolerate half‑node failures.
Cluster management: Static checks of hardware/firmware, progressive traffic ramp‑up from a few test VIPs, automated scaling triggers at CPU > 70 % or connection‑rate thresholds, and graceful shutdown via BGP withdrawal.
Configuration management: Version gray‑release (5 % rollout, 24 h monitoring) and one‑minute rollback using retained binaries.
Protocol optimization: Migration from multicast to UDP unicast improves reliability to 99.999 % at a 15 % bandwidth cost; odd‑even routing ensures flow affinity.
Security: Layered DDoS cleaning (edge gateway, TGW rate‑limiting per VIP), dynamic isolation of attacked VIPs, blacklist learning for repeated DP events, and protocol compliance checks that drop malformed GRE or QUIC frames.
Conclusion and Outlook
The TGW paper details core technologies and operational insights that can serve as a reference for future cloud‑gateway designs. Ongoing work will integrate hardware offload and programmable forwarding to further boost performance and reliability, driving the next generation of intelligent network infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
