How Tencent’s TRMT Boosted DeepSeek’s Communication: A Chinese Open‑Source Success

Tencent’s Star‑Network team partnered with DeepSeek to open‑source the DeepEP communication library, then used its self‑developed TRMT stack to overcome RoCE limitations, achieving up to 100% speedup on RoCEv2 and 30% on InfiniBand, cutting training costs and inference latency for large MoE models.

Smart Era Software Development
Smart Era Software Development
Smart Era Software Development
How Tencent’s TRMT Boosted DeepSeek’s Communication: A Chinese Open‑Source Success

In February, DeepSeek announced the open‑source release of five core codebases, including DeepEP, a communication library built specifically for Mixture‑of‑Experts (MoE) model training and inference. DeepEP targets the key bottleneck of MoE scalability by optimizing communication, reducing latency, and improving GPU resource utilization.

MoE architectures are celebrated for dramatically lowering training and inference costs of massive models such as GPT‑4 and DeepSeek. Early in 2024, Tencent’s self‑developed HunYuan large model became one of the first domestic models to adopt MoE. Historically, MoE training relied on Nvidia’s high‑cost NCCL library, but DeepEP offers an alternative communication stack.

DeepEP performs exceptionally on high‑end InfiniBand (IB) networks but struggles on the more prevalent RoCE (RDMA over Converged Ethernet) networks used by most Chinese internet companies, leading to noticeable performance degradation.

Tencent’s Star‑Network team leveraged a decade of experience building high‑concurrency services (QQ, WeChat, games, cloud) to design a dedicated AI network. In 2022, recognizing the divergent network demands of AI workloads, Tencent created the Star‑Network and later the self‑developed TRMT (Tencent Remote Memory Transport) library to address RoCE‑related shortcomings.

Using TRMT, the team applied two key techniques: traffic pre‑planning to maximize dual‑port NIC bandwidth, and GPU‑direct RDMA to bypass CPU control, eliminating control‑plane overhead and pushing latency to hardware limits. These changes yielded a 100% performance increase on RoCEv2 and a 30% increase on IB networks.

According to communication‑library architect Huang Xiaojie, a 10% training speedup translates to a 10% cost reduction, while inference latency dropped from roughly ten seconds to nine seconds, directly benefiting end‑users.

AI networks fall into two categories: IB, which offers ultra‑low latency but is dominated by Nvidia hardware and incurs high costs, and RoCE, which runs over Ethernet and avoids supply‑chain risks. Tencent chose RoCE as the baseline and evolved its own TCCL library into the newer TRMT to meet AI workload requirements.

The overarching goal is to minimize the communication fraction of GPU time. Tencent’s approach combines multiple GPUs into a “super‑GPU,” allowing direct peer‑to‑peer memory access and reducing reliance on CPUs, a crucial strategy given the relatively lower compute density of domestic GPUs.

All optimized code has been fully open‑sourced to the DeepEP community and is already deployed in Tencent’s HunYuan model and other leading domestic internet firms, receiving feedback and contributions from the broader ecosystem.

Both Tencent and DeepSeek view open‑source as a strategic lever for technology democratization and industry trust, positioning the collaboration as a milestone for China’s AI open‑source ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeepSeekMoEAI trainingRoCEDeepEPTRMT
Smart Era Software Development
Written by

Smart Era Software Development

Committed to openness and connectivity, we build frontline engineering capabilities in software, requirements, and platform engineering. By integrating digitalization, cloud computing, blockchain, new media and other hot tech topics, we create an efficient, cutting‑edge tech exchange platform and a diversified engineering ecosystem. Provides frontline news, summit updates, and practical sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.