Artificial Intelligence 6 min read

How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network

DeepSeek engineers highlighted Tencent’s open‑source TRMT and DeepEP contributions that boost GPU‑to‑GPU communication by up to 300%, double RoCE performance and add a further 30% gain on InfiniBand, while addressing lane‑utilization and CPU‑control bottlenecks through three targeted optimizations.

Linux Kernel Journey

May 8, 2025

How Tencent’s TRMT Tech Delivered a Huge Speedup to DeepSeek’s Large‑Model Network

DeepSeek engineers recently highlighted a code contribution from Tencent on GitHub, describing it as a "huge speedup" for their large‑model network.

The core of the contribution is Tencent’s TRMT (Data Center and GPU Communication) technology, embodied in the open‑source DeepEP framework. DeepEP breaks the NCCL performance ceiling, delivering a 300% communication efficiency improvement that frees many Mixture‑of‑Experts (MoE) models from reliance on NVIDIA NCCL.

However, DeepEP performs excellently on high‑cost InfiniBand (IB) networks but struggles on the more common RoCE environments. The issues stem from low lane utilization on dual‑port RoCE NICs and a CPU‑controlled data‑path that adds latency and energy overhead.

Dual‑Lane Full Utilization: Topology‑Aware Multi‑QP Linking

Tencent applied a dynamic allocation algorithm that balances traffic across both NIC ports. By assigning UDP source ports intelligently, each GPU pair establishes multiple Queue Pairs (QPs) that map to distinct physical lanes, achieving near‑theoretical peak bandwidth.

Bypassing the CPU Control Plane with IBGDA

Leveraging InfiniBand GPU Direct Accelerator (IBGDA), Tencent eliminated CPU involvement in the control plane, pushing control‑latency down to the hardware limit.

Atomic Signalling Coordination: QP Internal Sequencing Lock

A hardware‑generated digital fingerprint (the "QP internal sequencing lock") ensures that data arriving at a destination GPU is processed in the correct order, even when thousands of transfers occur concurrently. This eliminates the "first‑sent, later‑arrived" chaos and guarantees orderly processing.

The three‑pronged optimization—dual‑lane utilization, IBGDA‑based control‑plane bypass, and atomic signalling—doubles DeepEP’s performance on RoCE and adds an additional 30% boost when the same techniques are applied back to InfiniBand networks.

All these advances have been fully open‑sourced in the DeepEP community and are already deployed in Tencent’s Hunyuan large‑model training and inference pipelines on high‑performance clusters built with StarRocks networking and H20 servers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

DeepSeek Tencent InfiniBand RoCE GPU communication DeepEP TRMT

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.