Artificial Intelligence 9 min read

How AI Is Polluting the Internet: Real Cases and Emerging Risks

The article examines how AI-generated content is flooding the Chinese internet with unverified information, causing misinformation across platforms like Bing, Zhihu, Stack Overflow and Reddit, and discusses research showing that training models on such data can degrade future AI performance.

Programmer DD

Jun 19, 2023

How AI Is Polluting the Internet: Real Cases and Emerging Risks

AI has become a major source of misinformation on the Chinese internet, with AI accounts rapidly producing unverified answers that mislead users.

This AI is polluting Chinese internet.

One example is a Bing response that confidently answered a question about a cable car on Elephant Trunk Mountain, then linked to a reference that turned out to be generated by an AI user named “百变人生”. The account answered questions within one to two minutes, often providing unverified information, and was eventually silenced on Zhihu.

AI pollution sources are not limited to one platform

Similar AI‑generated fake news appears elsewhere: a fabricated story about a chicken‑steak shop murder in Zhengzhou, and another false report of a train accident in Gansu, both created with AI tools for click‑bait and profit, leading to criminal actions by authorities.

Internationally, the problem is evident on Stack Overflow, which temporarily disabled AI‑generated answers because the error rate of ChatGPT responses was too high for the community to verify. Reddit also hosts many AI‑driven Q&A bots with uncertain answer quality.

Researchers from Cambridge and Edinburgh highlighted the danger of “data pollution” in an arXiv paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget”. They warned that models trained on AI‑generated content develop irreversible defects, making future high‑quality data scarce.

Just as we fill the oceans with plastic and the atmosphere with CO₂, we are about to fill the internet with garbage.

Experts like Daphne Ippolito of Google Brain note that finding untainted data for future training will become increasingly difficult, and that a feedback loop of low‑quality AI output could cripple AI development.

Some platforms are beginning to address the issue by implementing policies to limit AI‑generated low‑quality content and developing detection tools to identify AI‑created text.