Tsinghua Special Award Winner Yuxian Gu Joins DeepSeek
Yuxian Gu, a 2021 Tsinghua PhD and 2025 Special Scholarship laureate, has joined DeepSeek, bringing expertise in pre‑training data selection, knowledge‑distillation for model compression, and efficient model architectures such as Jet‑Nemotron, which outperforms leading open‑source LLMs with up to 53.6× speedup on H100.
DeepSeek is rapidly expanding its talent pool across algorithm, R&D, product, operations, and data engineering roles, and the company announced that DeepSeek V4 will be released mid‑month. In the V4 paper author list, Tsinghua University 2021 PhD student and 2025 Special Scholarship recipient Yuxian Gu was identified, and it is now confirmed that he has officially joined DeepSeek.
Gu earned the 2025 Apple Doctoral Scholarship and the Ant In‑Tech Scholarship. He studied in the Interactive AI group (Conversational AI, CoAI) at Tsinghua under Prof. Huang Minlie, and his personal homepage is https://t1101675.github.io/.
His research focuses on improving efficiency throughout the full lifecycle of large language models, covering pre‑training, downstream adaptation, and inference. He pursues three main directions:
Pre‑training data selection : developing theory and algorithms to optimize data choice for training stronger, more efficient models (e.g., PDS, Instruction Pre‑training, Learning Law).
Knowledge distillation for model compression : creating methods to transfer knowledge from large models to smaller, deployable ones, with representative works MiniLLM and MiniPLM.
Efficient model architecture : designing new architectures that lower computational cost while boosting performance, exemplified by Jet‑Nemotron.
According to his Google Scholar profile, Gu’s papers have been cited nearly 5,000 times, with over 1,000 citations for two works: "Pre‑trained models: Past, present and future" and "MiniLLM: Knowledge distillation of large language models". He has been first author on multiple papers at top AI conferences such as NeurIPS, ICLR, and ACL.
Jet‑Nemotron, a novel hybrid‑architecture language model series, achieves state‑of‑the‑art full‑attention accuracy while delivering remarkable efficiency. Its core innovations are:
Post Neural Architecture Search (PostNAS) : an efficient post‑training architecture exploration and adaptation pipeline applicable to any pre‑trained Transformer.
JetBlock : a new linear‑attention module that outperforms prior designs like Mamba2.
The 2B‑parameter version of Jet‑Nemotron already surpasses leading open‑source full‑attention models such as Qwen3, Qwen2.5, Gemma3, and Llama3.2, delivering up to 53.6× throughput acceleration on H100 GPUs (context length 256K, maximum batch size). On the MMLU and MMLU‑Pro benchmarks, Jet‑Nemotron’s accuracy exceeds several larger MoE full‑attention models, including DeepSeek‑V3‑Small and Moonlight.
In 2024, Gu and collaborators introduced a knowledge‑distillation method that replaces the standard forward Kullback‑Leibler divergence with a reverse KLD objective, deriving an effective optimization strategy. The resulting student model, named MiniLLM, demonstrated superior answer precision, lower exposure bias, better calibration, and stronger long‑text generation in extensive instruction‑following experiments. This approach has been adopted by leading open‑source communities and industry platforms such as Google, Alibaba, and Nvidia.
We look forward to Gu’s next contributions at DeepSeek and anticipate further breakthroughs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
