Artificial Intelligence 24 min read

How Baidu Is Shaping Text‑to‑Image AI: Trends, Challenges, and Future Outlook

In this interview, Baidu's search architect Tianbao explains the evolution of text‑to‑image generation since 2022, discusses data preparation, model quality, prompt engineering, multi‑style support, evaluation methods, and predicts when fully AI‑generated video and movies might become mainstream.

Baidu Tech Salon

Nov 7, 2023

How Baidu Is Shaping Text‑to‑Image AI: Trends, Challenges, and Future Outlook

Background

Since 2023, AIGC (AI‑generated content) has driven a new wave of artificial‑intelligence applications, with text‑to‑image (often called AI painting) achieving major breakthroughs. Baidu’s search architect Tianbao shares the technology’s development, practical usage in Baidu Search, and future directions.

Key Highlights

The workflow has shifted from pure image search to a combined "search + generation" approach, encouraging users to express more precise visual needs.

Improving Chinese language understanding requires careful collection and cleaning of Chinese‑semantic corpora.

Removing low‑quality samples and constructing high‑value image‑text pairs are essential for effective alignment.

Baidu Search now supports thousands of distinct visual styles to satisfy diverse user demands.

Adhering to aesthetic standards guides both model architecture and algorithmic optimization.

Technical Development Timeline

2022 is regarded as the "year of text‑to‑image". Open‑source models such as Stable Diffusion 1.5 and Disco Diffusion introduced landscape‑focused generation, while Midjourney released v1 (impressive overall quality) and v3 (improved human portrait generation). The open‑source surge sparked rapid ecosystem growth and downstream applications.

Baidu’s Approach

Baidu integrates text‑to‑image generation into its search product, allowing users to type prompts like "draw an angry cat" and receive generated images directly in the app. The system supports editing operations such as inpainting and outpainting, enabling users to replace or extend elements (e.g., swapping a hat‑wearing cat for a dog).

To handle Chinese semantics, Baidu leverages its massive web‑wide Chinese corpus, performs extensive data cleaning, balances sample quality, and builds specialized operators for de‑duplication and aesthetic assessment. High‑quality datasets are further refined using relevance‑model scores (e.g., CLIPScore) to filter mismatched image‑text pairs.

Prompt Engineering and Style Control

The platform offers over a thousand predefined style options (comic, watercolor, metal, etc.) and supports multi‑style prompts. Users can combine style descriptors (e.g., "ink‑wash + fisheye") but must manage controllability, as prompt order can bias results. Baidu’s research focuses on making prompts more expressive while preserving content consistency.

Evaluation and Feedback Loop

Feedback is collected through user actions such as image clicks, enlargements, downloads, likes, and comments. While Reinforcement Learning from Human Feedback (RLHF) is valuable, human judgments are noisy; therefore, Baidu combines behavioral signals with automated metrics (e.g., relevance and aesthetic scores) to prioritize high‑quality outputs and reduce manual evaluation workload.

Future Outlook

Beyond static images, Baidu is exploring video generation. Current challenges include maintaining temporal consistency and higher computational demands. Experts anticipate that within one to two years, breakthroughs comparable to Stable Diffusion will enable longer, coherent AI‑generated video clips, potentially leading to fully AI‑generated movies.

prompt engineering text-to-image AI evaluation AIGC Baidu future of AI video

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.