Why DeepSeek V3.1 Randomly Inserts the Chinese Character “极” – Token Bug Explained
DeepSeek’s latest V3.1 model unexpectedly injects the Chinese character “极” into generated text, a token‑ID mix‑up that breaks code compilation, JSON parsing, and academic writing, with users tracing the issue to adjacent token IDs and two main hypotheses of dataset contamination or model shortcut.
Several users reported that DeepSeek V3.1 suddenly inserts the Chinese character “极” (both simplified and traditional) into generated text without warning. This stray character causes compilation failures, JSON format errors, and undermines the rigor of academic writing.
The issue also appears, albeit less frequently, in DeepSeek’s official Playground, indicating it is not limited to third‑party API platforms.
Root Cause Hypotheses
Technical analysis shows that the token ID for “极” is 2577, while the token ID for the commonly used ellipsis “…” is 2576, making them adjacent in the model’s vocabulary.
1. Dataset Contamination
During data cleaning, some entries containing special or abnormal characters may not have been fully filtered, leaving the “极” token in the training set.
2. Model Shortcut
The model may have learned a shortcut during training, mistakenly selecting the neighboring token in certain contexts. Once triggered, the bug seems “addictive,” with the frequency of “极” increasing in subsequent interactions.
Impact Scope
Code generation: random Chinese characters cause compilation failures.
API calls: JSON and other structured outputs break.
Academic writing: precision and professionalism are compromised.
Tencent Cloud CodeBuddy has contacted the DeepSeek team and plans to include a fix in the next version.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
