How Wenxin X1.1 Tops China’s LLMs on the New SuperCLUE-CPIF Benchmark

Recent release of the SuperCLUE-CPIF benchmark shows Baidu’s Wenxin X1.1 achieving the highest score among Chinese large language models, surpassing competitors like DeepSeek‑V3.2‑Exp‑Thinking and Hunyuan‑T1, with notable advantages in precise instruction following and complex task handling.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
How Wenxin X1.1 Tops China’s LLMs on the New SuperCLUE-CPIF Benchmark

SuperCLUE-CPIF Benchmark Launch

Recently, the Chinese Precise Instruction Following benchmark (SuperCLUE-CPIF) was officially released. Wenxin X1.1 scored 75.51 points, ranking first among domestic large models and leading both in task type and instruction count categories.

The evaluation covered ten models, including GPT-5(high), DeepSeek-V3.2-Exp-Thinking, Claude-Sonnet-4.5-Reasoning, Gemini-2.5-Pro and others. SuperCLUE-CPIF focuses on assessing large language models’ ability to precisely follow complex, multi-constraint Chinese instructions and to transform natural-language commands into outputs that satisfy all requirements.

Results show Wenxin X1.1 at the top with 75.51 points, followed by DeepSeek-V3.2-Exp-Thinking (73.98) and Hunyuan-T1-20250822 (65.82) among domestic models.

Wenxin X1.1 is built on the Wenxin 4.5 model and uses an iterative mixed-reinforcement-learning framework. This approach simultaneously improves general-task performance and agent-task performance, while iterative self-distillation data generation continuously enhances overall model quality.

At the WAVE SUMMIT 2025 demo, X1.1 demonstrated strong capabilities in complex writing tasks, combining internal knowledge, web-search tools, and deep reasoning to produce fact-accurate, well-structured, logical, and elegant content. In multi-step service scenarios such as handling diverse user issues on a bike-sharing platform, the model followed business processes, invoked tools autonomously, and adapted to user emotions.

As one of the earliest Chinese companies to invest in large-model R&D, Baidu leverages a full-stack self-developed ecosystem of chips, frameworks, models, and applications. Thanks to PaddlePaddle-Wenxin joint optimization, X1.1 improves factuality by 34.8%, instruction compliance by 12.5%, and agent performance by 9.6% compared with the original X1.

large language modelsbenchmarkAI evaluationinstruction followingWenxin X1.1
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.