Analyzing LLM Failure Cases: Tokenization, Next‑Token Prediction, and Chain‑of‑Thought Prompting
The article explains how tokenization mismatches and biased next‑token prediction cause LLMs to miscount letters in “Strawberry” and incorrectly compare 9.9 versus 9.11, and shows that step‑by‑step Chain‑of‑Thought prompting with reason‑first output dramatically improves accuracy.
This article examines two common large‑language‑model (LLM) failure cases – the incorrect count of the letter “r” in “Strawberry” and the wrong comparison between 9.9 and 9.11 – and explains the underlying technical reasons.
First, it shows that the errors stem from the LLM’s tokenization process. LLMs split input text into tokens that do not always correspond to human words; for example, “Strawberry” is tokenized as ["str", "aw", "berry"], so the model sees three tokens instead of the individual letters, leading to a miscount.
Second, the article discusses the next‑token prediction mechanism. Because LLMs generate text by predicting the most probable next token based on training data, biases in the data (e.g., many version‑number examples) cause the model to treat 9.11 as larger than 9.9, even when the numeric comparison should be the opposite.
To mitigate these issues, the author introduces Chain‑of‑Thought (CoT) prompting, which guides the model to reason step‑by‑step before producing an answer. Sample prompts are provided:
Strawberry里有几个r?请你先将单词拆分成一个个字母,再用0和1分别标记字母列表中非r和r的位置,数一下一共有几个r
请一步步思考,以逐级复杂的原则思考问题,最后才得到答案。9.9和9.11谁大?
Experiments show that CoT prompts significantly increase the correctness of the responses. The article also highlights a “reason‑first” output style, where the model is asked to explain its reasoning before giving the final answer, which aligns with the autoregressive nature of LLMs and further improves reliability.
Supporting evidence includes a recent research paper ("A Better LLM Evaluator for Text Generation: The Impact of Prompt Output Sequencing and Optimization") that found asking LLMs to provide reasoning before a score yields higher quality evaluations.
Finally, the author shares a practical logistics‑domain prompt that enforces the reason‑first style and demonstrates its effectiveness in real‑world applications.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.