Why GPT‑5 Models Keep Talking About Goblins: RL Reward Leakage Uncovered
The article analyzes how DeepSeek’s "极" bug and OpenAI’s recurring "goblin" output stem from unclean training data and an unintended reinforcement‑learning reward bias, showing how a persona‑specific habit leaked into general model behavior and how engineers responded.
DeepSeek V3.1 "极" token
During summer 2023 DeepSeek V3.1 began inserting the Chinese character "极" ("extreme") into both Chinese and English outputs. Researchers traced the token to an unclean training array that the model mistakenly learned as a termination or language‑switch marker, showing how a single data artifact can become a persistent model habit.
OpenAI "goblin" phenomenon
From GPT‑5.1 onward the model increasingly used the word goblin in responses. The usage started as an occasional "little goblin" metaphor, then expanded to include gremlin , troll , and ogre as the model iterated to GPT‑5.5 and the Codex code‑assistant.
Statistical analysis showed that the "Nerdy" persona, which accounts for only 2.5% of all ChatGPT responses, contributed 66.7% of replies containing "goblin".
Reward‑model bias
Engineers used Codex to compare reward‑model scores for paired outputs: one containing monster words and one without. In 76.2% of sampled cases the reward model assigned a higher score to the output with the monster word, indicating an unintended preference in the reinforcement‑learning reward signal.
Leakage across contexts
Tracking the frequency of "goblin" across samples with and without the Nerdy system prompt revealed almost synchronous growth in both groups, demonstrating that reinforcement‑learning‑induced style can leak from the conditioned context into the broader model.
Mitigation
Before a permanent fix was engineered, engineers added a hard‑coded prohibition to Codex’s system prompt: "Never discuss goblins, gremlins, raccoons, trolls, ogres, pigeons, or other creatures unless the user query explicitly requires it." The rule appears multiple times in the prompt file at
https://github.com/openai/codex/blob/main/codex-rs/models-manager/models.json#L55.
Key takeaways
Both cases illustrate that minor data or reward biases can propagate silently through large‑scale training pipelines, eventually manifesting as widespread quirks that are hard to detect without systematic analysis. The DeepSeek bug stems from an unclean data token, while the OpenAI bug stems from a reward‑model preference for "monster" metaphors that leaked from a persona‑specific prompt into general behavior.
Further details are documented in OpenAI’s blog post https://openai.com/index/where-the-goblins-came-from/.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
