Inside Baidu’s New Wenxin 4.5 & X1: Multimodal Breakthroughs and Tool‑Enabled AI

Baidu officially launched the Wenxin 4.5 and X1 large language models, showcasing native multimodal foundations, advanced attention masks, heterogeneous expert extensions, and tool‑calling capabilities, while offering low‑cost API access on the Qianfan platform and outlining the underlying technical innovations that drive their performance gains.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Inside Baidu’s New Wenxin 4.5 & X1: Multimodal Breakthroughs and Tool‑Enabled AI

Wenxin 4.5: A Native Multimodal Foundation Model

On March 16, Baidu released Wenxin 4.5 and Wenxin X1. Wenxin 4.5 is a next‑generation native multimodal foundation model that jointly models text, images, audio, and video, delivering stronger language understanding, generation, logical reasoning, and reduced hallucination.

Key technical advances include:

FlashMask Dynamic Attention Mask : accelerates flexible attention mask computation, improving long‑sequence modeling and multi‑turn interaction efficiency.

Multimodal Heterogeneous Expert Extension : builds modality‑specific experts and an adaptive loss to balance gradient contributions across modalities.

Spatio‑Temporal Representation Compression : efficiently compresses semantic representations of images and videos, boosting training speed for long video inputs.

Knowledge‑Centric Large‑Scale Data Construction : uses hierarchical sampling, data compression, and targeted synthesis of scarce knowledge points to increase pre‑training data density and lower hallucination.

Self‑Feedback Post‑Training : integrates multiple evaluation signals in an iterative feedback loop, enhancing alignment with human intent.

Wenxin 4.5 demonstrates superior multimodal understanding, correctly interpreting combined text‑image questions, extracting key information, and providing detailed solution steps. It also grasps internet memes and satirical cartoons, explaining underlying concepts and humor.

Multimodal capability illustration
Multimodal capability illustration

Wenxin X1: A Deep‑Thinking Model with Tool Use

Wenxin X1 builds on the multimodal foundation of 4.5 but adds stronger comprehension, planning, reflection, and evolution abilities. It is the first Baidu model capable of autonomously invoking external tools, making it a deep‑thinking system.

Supported tools include advanced search, document Q&A, image understanding, AI drawing, code interpretation, web page reading, TreeMind graph generation, Baidu Scholar lookup, commercial information queries, and franchise information retrieval.

Example: X1 rewrites the classic poem "Han Yao Fu" by inserting historical figures from various Chinese dynasties, first analyzing the original style, then selecting appropriate references, and finally generating text that matches the original intent and tone.

Tool‑calling example illustration
Tool‑calling example illustration

Key technologies powering X1 include:

Progressive Reinforcement Learning : applies a staged RL approach to improve performance across creation, search, tool use, and reasoning tasks.

End‑to‑End Training with Thought and Action Chains : trains the model directly on feedback from tool‑driven outcomes, enhancing overall effectiveness.

Unified Reward System : merges multiple reward signals into a single robust feedback mechanism.

Platform Access and Pricing

Both models are available on Baidu’s Qianfan large‑model platform. API pricing is low: Wenxin 4.5 costs ¥0.004 per 1,000 input tokens and ¥0.016 per 1,000 output tokens; Wenxin X1 will launch soon with ¥0.002 per 1,000 input tokens and ¥0.008 per 1,000 output tokens.

The Qianfan platform emphasizes an open, easy‑to‑use, low‑cost environment that accelerates AI application development from concept to production, supporting a wide range of industries.

Future Outlook

2025 is projected to be a year of comprehensive iteration for large‑model technology, with increased investment in AI chips, data centers, and cloud infrastructure to build the next generation of smarter models.

multimodal AIlarge language modelTool CallingAI PlatformBaiduWenxin
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.