Artificial Intelligence 7 min read

Why Large Language Models Mistake 9.11 > 9.9: Prompting, Tokenizer Effects, and Recent Findings

The article examines why leading large language models such as GPT‑4o, Gemini Advanced, and Claude 3.5 incorrectly claim that 9.11 is larger than 9.9, analyzes tokenization and prompting strategies that cause the error, and discusses recent research and OpenAI model updates.

IT Services Circle

Jul 17, 2024

Why Large Language Models Mistake 9.11 > 9.9: Prompting, Tokenizer Effects, and Recent Findings

Prompt engineer Riley Goodside discovered that several mainstream large language models (GPT‑4o, Google Gemini Advanced, Claude 3.5 Sonnet) consistently answer that 9.11 is larger than 9.9, a simple arithmetic fact that most humans get right.

Goodside, a senior prompt engineer at Scale AI, reproduced the mistake by asking the question directly in English and then in Chinese, finding that most models failed unless the options were placed before the question or the wording was altered.

Various Chinese models were tested: Kimi and ChatGLM gave wrong answers, while Tencent Yuanbao and ByteDance Doubao produced correct results, with Doubao even explaining the comparison method.

The root cause is traced to tokenization: the decimal numbers are split into tokens (e.g., "9", ".", "11"), and the token representing "11" receives a higher ID than the token for "9", leading the model to infer that 11 > 9.

Providing explicit context that the numbers are double‑precision floating‑point values or rearranging the prompt (placing options first) helps the models arrive at the correct answer.

Prompting techniques such as Zero‑shot Chain‑of‑Thought (CoT) can solve the problem, whereas role‑playing prompts are less effective, a finding supported by a study of over 1,500 papers showing diminishing returns for role‑play prompting.

The article also notes a Reuters report that OpenAI is testing a new internal model (codenamed “Strawberry”) that scores over 90% on the MATH benchmark, though it remains unclear whether this model can correctly handle the 9.11 vs 9.9 comparison without additional prompting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Prompt engineering large language models Tokenization AI reasoning Numerical Comparison

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.