How Baidu Is Shaping Text‑to‑Image AI: Trends, Challenges, and Future Outlook
In this interview, Baidu's search architect Tianbao explains the evolution of text‑to‑image generation since 2022, discusses data preparation, model quality, prompt engineering, multi‑style support, evaluation methods, and predicts when fully AI‑generated video and movies might become mainstream.
Background
Since 2023, AIGC (AI‑generated content) has driven a new wave of artificial‑intelligence applications, with text‑to‑image (often called AI painting) achieving major breakthroughs. Baidu’s search architect Tianbao shares the technology’s development, practical usage in Baidu Search, and future directions.
Key Highlights
The workflow has shifted from pure image search to a combined "search + generation" approach, encouraging users to express more precise visual needs.
Improving Chinese language understanding requires careful collection and cleaning of Chinese‑semantic corpora.
Removing low‑quality samples and constructing high‑value image‑text pairs are essential for effective alignment.
Baidu Search now supports thousands of distinct visual styles to satisfy diverse user demands.
Adhering to aesthetic standards guides both model architecture and algorithmic optimization.
Technical Development Timeline
2022 is regarded as the "year of text‑to‑image". Open‑source models such as Stable Diffusion 1.5 and Disco Diffusion introduced landscape‑focused generation, while Midjourney released v1 (impressive overall quality) and v3 (improved human portrait generation). The open‑source surge sparked rapid ecosystem growth and downstream applications.
Baidu’s Approach
Baidu integrates text‑to‑image generation into its search product, allowing users to type prompts like "draw an angry cat" and receive generated images directly in the app. The system supports editing operations such as inpainting and outpainting, enabling users to replace or extend elements (e.g., swapping a hat‑wearing cat for a dog).
To handle Chinese semantics, Baidu leverages its massive web‑wide Chinese corpus, performs extensive data cleaning, balances sample quality, and builds specialized operators for de‑duplication and aesthetic assessment. High‑quality datasets are further refined using relevance‑model scores (e.g., CLIPScore) to filter mismatched image‑text pairs.
Prompt Engineering and Style Control
The platform offers over a thousand predefined style options (comic, watercolor, metal, etc.) and supports multi‑style prompts. Users can combine style descriptors (e.g., "ink‑wash + fisheye") but must manage controllability, as prompt order can bias results. Baidu’s research focuses on making prompts more expressive while preserving content consistency.
Evaluation and Feedback Loop
Feedback is collected through user actions such as image clicks, enlargements, downloads, likes, and comments. While Reinforcement Learning from Human Feedback (RLHF) is valuable, human judgments are noisy; therefore, Baidu combines behavioral signals with automated metrics (e.g., relevance and aesthetic scores) to prioritize high‑quality outputs and reduce manual evaluation workload.
Future Outlook
Beyond static images, Baidu is exploring video generation. Current challenges include maintaining temporal consistency and higher computational demands. Experts anticipate that within one to two years, breakthroughs comparable to Stable Diffusion will enable longer, coherent AI‑generated video clips, potentially leading to fully AI‑generated movies.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
