What Do the Latest AIIA FactTesting Benchmarks Reveal About China’s Large Language Models?
At the AIIA’s 14th plenary meeting in Nanjing, the FactTesting benchmark released its Q1 2025 results, evaluating over 200 large models and highlighting Baidu’s Wenxin 4.5 and Wenxin X1 as leaders in basic and reasoning capabilities, while outlining the expanded multimodal and agent testing roadmap for the year.
The China Artificial Intelligence Industry Development Alliance (AIIA) continuously tracks large‑model and intelligent‑agent advancements. Since 2024 it has built the “FactTesting” benchmark, completing six monitoring rounds and testing more than 200 open‑source and closed‑source models. In 2025 the scope expands to multimodal understanding, text‑to‑image, text‑to‑video, and early autonomous‑agent evaluation.
Q1 2025 Benchmark Release
On 9 April 2025, at the 14th AIIA plenary meeting in Nanjing, the Q1 2025 FactTesting results were announced. Wei Kai, head of the overall group, presented the findings.
Basic Capability Rankings
Wenxin 4.5 from Baidu topped the basic‑capability scores.
Reasoning Capability Rankings
Wenxin X1 from Baidu achieved the highest reasoning scores.
Model Details
Wenxin 4.5 is Baidu’s next‑generation native multimodal foundation model. By jointly modeling multiple modalities it delivers strong multimodal comprehension, improved language abilities, reduced hallucinations, and enhanced logical reasoning and code generation.
Wenxin X1 offers stronger understanding, planning, reflection, and evolution capabilities, supports multimodal input, and is the first deep‑thinking model that autonomously uses tools. It excels in Chinese knowledge Q&A, literary creation, document writing, everyday dialogue, logical reasoning, complex calculations, and tool invocation.
Both models are freely available on the Wenxin Yiyan website ( https://yiyan.baidu.com).
Future Outlook
2025 is positioned as a year of comprehensive iteration for large‑model technology. Baidu plans to increase investments in AI, data centers, and cloud infrastructure to build the next generation of smarter models.
Related Reading
New PaddlePaddle 3.0 framework release: accelerating large‑model innovation.
Wenxin X1 now open to enterprise users.
Paper on Baidu’s ad recommendation system in the large‑model era.
DeepSeek‑VL2 multimodal model algorithm analysis.
Case study of a rapid‑growth app attracting 20 k users on launch day.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
