Artificial Intelligence 13 min read

Analyzing LLM Failure Cases: Tokenization, Next‑Token Prediction, and Chain‑of‑Thought Prompting

The article explains how tokenization mismatches and biased next‑token prediction cause LLMs to miscount letters in “Strawberry” and incorrectly compare 9.9 versus 9.11, and shows that step‑by‑step Chain‑of‑Thought prompting with reason‑first output dramatically improves accuracy.

DaTaobao Tech

Dec 9, 2024

Analyzing LLM Failure Cases: Tokenization, Next‑Token Prediction, and Chain‑of‑Thought Prompting

This article examines two common large‑language‑model (LLM) failure cases – the incorrect count of the letter “r” in “Strawberry” and the wrong comparison between 9.9 and 9.11 – and explains the underlying technical reasons.

First, it shows that the errors stem from the LLM’s tokenization process. LLMs split input text into tokens that do not always correspond to human words; for example, “Strawberry” is tokenized as ["str", "aw", "berry"], so the model sees three tokens instead of the individual letters, leading to a miscount.

Second, the article discusses the next‑token prediction mechanism. Because LLMs generate text by predicting the most probable next token based on training data, biases in the data (e.g., many version‑number examples) cause the model to treat 9.11 as larger than 9.9, even when the numeric comparison should be the opposite.

To mitigate these issues, the author introduces Chain‑of‑Thought (CoT) prompting, which guides the model to reason step‑by‑step before producing an answer. Sample prompts are provided:

Strawberry里有几个r？请你先将单词拆分成一个个字母，再用0和1分别标记字母列表中非r和r的位置，数一下一共有几个r

请一步步思考，以逐级复杂的原则思考问题，最后才得到答案。9.9和9.11谁大？

Experiments show that CoT prompts significantly increase the correctness of the responses. The article also highlights a “reason‑first” output style, where the model is asked to explain its reasoning before giving the final answer, which aligns with the autoregressive nature of LLMs and further improves reliability.

Supporting evidence includes a recent research paper ("A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization") that found asking LLMs to provide reasoning before a score yields higher quality evaluations.

Finally, the author shares a practical logistics‑domain prompt that enforces the reason‑first style and demonstrates its effectiveness in real‑world applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Prompt engineering reasoning tokenization Chain-of-Thought

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.