How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network
DeepSeek engineers highlighted Tencent’s open‑source TRMT and DeepEP contributions that boost GPU‑to‑GPU communication by up to 300%, double RoCE performance and add a further 30% gain on InfiniBand, while addressing lane‑utilization and CPU‑control bottlenecks through three targeted optimizations.
DeepSeek engineers recently highlighted a code contribution from Tencent on GitHub, describing it as a "huge speedup" for their large‑model network.
The core of the contribution is Tencent’s TRMT (Data Center and GPU Communication) technology, embodied in the open‑source DeepEP framework. DeepEP breaks the NCCL performance ceiling, delivering a 300% communication efficiency improvement that frees many Mixture‑of‑Experts (MoE) models from reliance on NVIDIA NCCL.
However, DeepEP performs excellently on high‑cost InfiniBand (IB) networks but struggles on the more common RoCE environments. The issues stem from low lane utilization on dual‑port RoCE NICs and a CPU‑controlled data‑path that adds latency and energy overhead.
Dual‑Lane Full Utilization: Topology‑Aware Multi‑QP Linking
Tencent applied a dynamic allocation algorithm that balances traffic across both NIC ports. By assigning UDP source ports intelligently, each GPU pair establishes multiple Queue Pairs (QPs) that map to distinct physical lanes, achieving near‑theoretical peak bandwidth.
Bypassing the CPU Control Plane with IBGDA
Leveraging InfiniBand GPU Direct Accelerator (IBGDA), Tencent eliminated CPU involvement in the control plane, pushing control‑latency down to the hardware limit.
Atomic Signalling Coordination: QP Internal Sequencing Lock
A hardware‑generated digital fingerprint (the "QP internal sequencing lock") ensures that data arriving at a destination GPU is processed in the correct order, even when thousands of transfers occur concurrently. This eliminates the "first‑sent, later‑arrived" chaos and guarantees orderly processing.
The three‑pronged optimization—dual‑lane utilization, IBGDA‑based control‑plane bypass, and atomic signalling—doubles DeepEP’s performance on RoCE and adds an additional 30% boost when the same techniques are applied back to InfiniBand networks.
All these advances have been fully open‑sourced in the DeepEP community and are already deployed in Tencent’s Hunyuan large‑model training and inference pipelines on high‑performance clusters built with StarRocks networking and H20 servers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
