Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered

The article analyzes how DeepSeek’s "极" bug and OpenAI’s recurring "goblin" output stem from unclean training data and an unintended reinforcement‑learning reward bias, showing how a persona‑specific habit leaked into general model behavior and how engineers responded.

Machine Heart
Machine Heart
Machine Heart
Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered

DeepSeek V3.1 "极" token

During summer 2023 DeepSeek V3.1 began inserting the Chinese character "极" ("extreme") into both Chinese and English outputs. Researchers traced the token to an unclean training array that the model mistakenly learned as a termination or language‑switch marker, showing how a single data artifact can become a persistent model habit.

OpenAI "goblin" phenomenon

From GPT‑5.1 onward the model increasingly used the word goblin in responses. The usage started as an occasional "little goblin" metaphor, then expanded to include gremlin , troll , and ogre as the model iterated to GPT‑5.5 and the Codex code‑assistant.

Statistical analysis showed that the "Nerdy" persona, which accounts for only 2.5% of all ChatGPT responses, contributed 66.7% of replies containing "goblin".

Reward‑model bias

Engineers used Codex to compare reward‑model scores for paired outputs: one containing monster words and one without. In 76.2% of sampled cases the reward model assigned a higher score to the output with the monster word, indicating an unintended preference in the reinforcement‑learning reward signal.

Leakage across contexts

Tracking the frequency of "goblin" across samples with and without the Nerdy system prompt revealed almost synchronous growth in both groups, demonstrating that reinforcement‑learning‑induced style can leak from the conditioned context into the broader model.

Mitigation

Before a permanent fix was engineered, engineers added a hard‑coded prohibition to Codex’s system prompt: "Never discuss goblins, gremlins, raccoons, trolls, ogres, pigeons, or other creatures unless the user query explicitly requires it." The rule appears multiple times in the prompt file at

https://github.com/openai/codex/blob/main/codex-rs/models-manager/models.json#L55

.

Key takeaways

Both cases illustrate that minor data or reward biases can propagate silently through large‑scale training pipelines, eventually manifesting as widespread quirks that are hard to detect without systematic analysis. The DeepSeek bug stems from an unclean data token, while the OpenAI bug stems from a reward‑model preference for "monster" metaphors that leaked from a persona‑specific prompt into general behavior.

Further details are documented in OpenAI’s blog post https://openai.com/index/where-the-goblins-came-from/.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsreinforcement learningGPT-5Goblin bugNerdy personareward leakage
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.