Why Large Language Models Mistake 9.11 > 9.9: Prompting, Tokenizer Effects, and Recent Findings
The article examines why leading large language models such as GPT‑4o, Gemini Advanced, and Claude 3.5 incorrectly claim that 9.11 is larger than 9.9, analyzes tokenization and prompting strategies that cause the error, and discusses recent research and OpenAI model updates.
Prompt engineer Riley Goodside discovered that several mainstream large language models (GPT‑4o, Google Gemini Advanced, Claude 3.5 Sonnet) consistently answer that 9.11 is larger than 9.9, a simple arithmetic fact that most humans get right.
Goodside, a senior prompt engineer at Scale AI, reproduced the mistake by asking the question directly in English and then in Chinese, finding that most models failed unless the options were placed before the question or the wording was altered.
Various Chinese models were tested: Kimi and ChatGLM gave wrong answers, while Tencent Yuanbao and ByteDance Doubao produced correct results, with Doubao even explaining the comparison method.
The root cause is traced to tokenization: the decimal numbers are split into tokens (e.g., "9", ".", "11"), and the token representing "11" receives a higher ID than the token for "9", leading the model to infer that 11 > 9.
Providing explicit context that the numbers are double‑precision floating‑point values or rearranging the prompt (placing options first) helps the models arrive at the correct answer.
Prompting techniques such as Zero‑shot Chain‑of‑Thought (CoT) can solve the problem, whereas role‑playing prompts are less effective, a finding supported by a study of over 1,500 papers showing diminishing returns for role‑play prompting.
The article also notes a Reuters report that OpenAI is testing a new internal model (codenamed “Strawberry”) that scores over 90% on the MATH benchmark, though it remains unclear whether this model can correctly handle the 9.11 vs 9.9 comparison without additional prompting.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.