Is Google Gemini Echoing Baidu? A Deep Dive into Model Contamination

The article investigates recent tests showing that Google Gemini sometimes claims to be Baidu's AI, reproduces Baidu‑related responses, and appears to have its Chinese and English corpora contaminated with competitor data, highlighting the challenges of data provenance in large language models.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
Is Google Gemini Echoing Baidu? A Deep Dive into Model Contamination

Gemini Pro vs. Baidu Wenxin Dialogue Test

Recent tests reveal that when prompted in Chinese, Gemini sometimes insists it is "Baidu" and, when given trigger words like "Xiaodu" or "Xiaoai," it responds as if it were those assistants, even offering help.

These observations suggest that Gemini's Chinese corpus may have been cleaned using outputs from Baidu's Wenxin model, and its English corpus may similarly contain cleaned OpenAI outputs.

Gemini API Test (Google Studio) – December 16

Setting the safety level to low and temperature to 0.5, the model was asked to introduce itself and then queried about its identity, consistently producing the same responses.

When asked about Baidu and its founder Li Yanhong, Gemini gave largely positive remarks. However, the next day the same queries no longer reproduced those answers, and the model began inserting negative information about Baidu and its founder, indicating an incomplete fix.

Gemini API Test (Google Studio) – December 17

Further testing showed that Gemini still echoed the earlier positive statements about Baidu when asked about Google, suggesting residual contamination.

Additional screenshots demonstrate that Gemini also appears to have incorporated English‑language data from OpenAI, as similar patterns emerge in its responses.

Additional Observations

Attempts to probe the model with more obscure prompts resulted in cryptic or blocked outputs, reinforcing the notion that the model’s training data has been heavily influenced by competitor content.

The author concludes that AI‑generated content is beginning to pollute the internet, with large language models inadvertently reproducing competitor material due to contaminated training corpora.

large language modelsOpenAIAI testingBaidu WenxinAI model contaminationChinese corpusGoogle Gemini
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.