Uncovering LLM Blind Spots in AI Coding: Common Pitfalls and Solutions

Large language models often struggle with coding tasks, failing to stop when encountering obstacles, ignoring black‑box testing principles, and making unnecessary refactors; this article examines those blind spots, offers practical examples, and suggests strategies such as preparatory refactoring, stateless tools, and careful prompting to improve AI‑assisted development.

ELab Team
ELab Team
ELab Team
Uncovering LLM Blind Spots in AI Coding: Common Pitfalls and Solutions

Stop Digging / 停止深挖

Outside of very tactical situations, current models do not know how to stop digging when they get into trouble. Suppose you want to implement feature X, but midway you realize you should do Y first. A human can abort and do Y, while an LLM will keep digging to finish the original task. This property can be desirable because the LLM follows the exact instruction rather than guessing intent.

除非遇到特别需要策略应对的情况,当前模型在陷入困境时无法主动停止继续执行。例如,当你想实现功能 X 时,中途发现因未先完成 Y 而导致任务困难,人类会暂停当前任务转而去实现 Y,但 LLM 会持续深挖,固执地完成原始指令。这种特性反而提供了可控性,因为 LLM 严格遵循指令而非推测用户意图。

Black Box Testing / 黑盒测试

Black box testing says that you should test the functionality of a component without knowing its internal structure. By default, LLMs have difficulty abiding by this because the implementation file is often put into the context, or the agent pulls up the implementation to understand how to interface with it. Sonnet 3.7 in Cursor also tends to eliminate redundancies in test files, even though black box testing suggests keeping redundancy to avoid reflecting bugs from the implementation directly in the test.

黑盒测试要求在不知道组件内部结构的情况下测试其功能。默认情况下,大语言模型(LLMs)难以遵循这一原则,因为实现文件会被自动加载到上下文中,或者模型会主动调取实现代码来理解接口交互。Cursor 中的 Sonnet 3.7 模型还表现出强烈的代码一致性倾向,它会试图消除测试文件中的冗余,尽管黑盒测试建议保留冗余以避免实现中的错误直接反映到测试中。

Preparatory Refactoring / 先整理代码再改功能

Preparatory Refactoring says that you should first refactor to make a change easy, and then make the change. The refactor can be involved, but because it preserves semantics it is easier to evaluate than the change itself. Current LLMs, without a plan that says they should refactor first, try to do everything at once and often over‑clean unrelated code. Reviewing LLM changes is important; all refactors should be proposed as separate changes ahead of time.

先整理代码再改功能的核心思想是:应先通过重构使代码更易于修改,然后再实施功能变更。重构过程可能涉及大量调整,但由于其保持语义不变,因此比直接修改功能更容易评估。当前的大语言模型(LLMs)缺乏"先重构再修改"的规划能力,它们倾向于一次性完成所有改动,甚至过度清理无关代码。审查LLM生成的变更至关重要,为了便于审查,所有重构都应提前作为独立提案提交。

Stateless Tools / 无状态

Your tools should be stateless: every invocation is independent, and there should be no persistent state that must be accounted for in the next invocation. Shell tools have a pernicious form of local state: the current working directory. Sonnet 3.7 is very bad at tracking the current working directory, so projects should be set up so that all commands can be run from a single directory.

你的工具应设计为无状态:每次调用应与其他调用完全独立,且后续调用时无需考虑任何在调用间持久化的状态。遗憾的是,shell 是一种非常流行的工具,但它具有一种特别有害的本地状态形式:当前工作目录。Sonnet 3.7 在跟踪当前工作目录方面表现极差。因此,请务必努力将项目设置为所有命令均可从单个目录运行。

Bulldozer Method / 推土机方法(暴力算法)

The Bulldozer Method, popularized by Dan Luu, suggests that sometimes you can achieve superhuman results by brute‑forcing a problem, spending enough tokens, or having the LLM build a workflow to brute‑force it. Look for opportunities where problems were previously dismissed as "too much work". Inspect what the LLM is actually doing, because it will happily repeat the same thing, unlike a human who would get bored and look for a better way.

丹・卢普及的 "推土机方法" 认为,有时只需坐下来进行暴力工作,然后利用在这个过程中学到的知识来提升效率,就能取得看似超人的成果。AI 编码正是暴力工作的典型代表:只要愿意投入足够的 token,你就可以用暴力方法解决大规模重构问题,或者让 LLM 构建你将用于暴力解决问题的工作流程。在那些过去被人们认为 "工作量太大" 的问题中寻找机会。但要注意检查 LLM 实际在做什么,因为它会乐此不疲地重复同样的工作,而人类会感到厌倦并主动寻找更好的方法。

Requirements, not Solutions / 需求,而非解决方案

In human software engineering, a common antipattern is to jump straight to proposing solutions without first articulating all requirements. Once all requirements are written down, the solution is often uniquely determined. If you ask an LLM to do something without specifying constraints, it will fill in blanks with the most probable answers from its training set, which may be fine but can lead to hallucinations when custom behavior is needed.

在人类软件工程中,试图解决问题时常见的一种反模式是:直接提出解决方案,而没有迫使所有人清晰阐明所有需求。通常,当你写下所有需求后,问题空间会被约束到足以让解决方案唯一确定;如果没有明确需求,讨论很容易陷入对特定解决方案的模糊争论中。大语言模型(LLM)对你的需求一无所知。当你在未明确所有约束的情况下要求它做某事时,它会用训练数据中最可能的答案来填补所有空白。

Walking Skeleton / 可运行框架

The Walking Skeleton is the minimal, crude implementation of an end‑to‑end system that contains all necessary pieces. The point is to get the system working first, then improve the pieces. In the era of LLM coding, it has never been easier to get the entire system running.

可运行框架是指一个端到端系统的最小化粗糙实现,它包含了你所需的所有关键组件。其核心思想在于:先让整个系统跑起来,再逐步对各个模块进行优化。

Use Static Types / 使用静态类型

The eternal debate between dynamic and static type systems concerns the trade‑off between ease of prototyping and long‑term maintainability. LLMs reduce the pressure to choose a prototyping‑friendly language because they can handle boilerplate and refactoring. Choose accordingly and ensure the LLM is informed about type errors after changes.

动态类型系统与静态类型系统之间的永恒争论聚焦于原型开发便利性与长期可维护性之间的权衡。大型语言模型(LLM)的兴起极大减轻了选择擅长原型开发语言的压力,因为 LLM 可以弥补样板代码和重构工作。请据此做出选择,并让模型在修改代码后能即时获知类型错误,以便在执行重构时轻松判断需要更新哪些其他文件。

Use MCP Servers / 使用 MCP 服务器

Model Context Protocol (MCP) servers provide a standard interface for LLMs to interact with their environment. Cursor Agent mode and Claude Code use agents extensively. Instead of a separate RAG system, the LLM can call an MCP to look up files it needs, run tests, or build, then act on the results.

模型上下文协议(MCP)服务器为大语言模型(LLM)提供了与环境交互的标准接口。Cursor 的代理模式和 Claude Code 等工具广泛使用了这一机制。例如,LLM 无需依赖独立的检索增强生成(RAG)系统来查找和加载相关上下文文件,而是可以直接调用 MCP 服务器,自主决定需要查阅哪些文件。

Mise en Place / 准备好环境

In cooking, mise en place means arranging all ingredients before work begins. For LLMs, it means ensuring all rules, MCPs, and development environment are correctly set up before a task. Sonnet 3.7 is not good at fixing a broken environment; a mis‑configured environment can cause the model to go down rabbit holes.

在 LLM 开发中,"Mise en Place" 意味着确保所有规则、MCP(模型控制协议)和模型可能需要的开发环境在任务启动前已正确配置。根据我的经验,Sonnet 3.7 在修复破损环境方面表现不佳;在实时生产环境中临时拼凑代码来修复问题,就像随机复制粘贴 StackOverflow 命令并期待奇迹发生——更可能的结果是永久性破坏环境。

The tail wagging the dog / 尾巴摇狗现象

The tail wagging the dog refers to small, unimportant things controlling larger, important things. LLMs are especially susceptible because everything they generate is added to the context. Irrelevant context makes it harder for the model to remember its goal. Careful initial prompting and context hygiene help mitigate this.

大语言模型(LLM)尤其容易陷入这个问题。在最常见的对话模式中,LLM 的所有行为都会被纳入上下文。尽管 LLM 具备一定的重要性判断能力,但如果在上下文中塞入大量无关信息,模型将越来越难以记住核心目标。通过初始阶段的谨慎提示和保持上下文卫生(context hygiene)能够有效缓解这个问题。

Know Your Limits / 了解局限性

It is important to know when you are out of your depth or lack the tools needed, so you can ask for help. Sonnet 3.7 is not good at recognizing its limits; you must explicitly prompt it to warn about hallucinations or unknown tasks.

Sonnet 3.7 在认知自身局限性方面表现不佳。如果你希望它能主动告知无法完成某些任务,至少需要明确提示(例如:Sonnet 的系统提示要求它在被问及非常小众的话题时,必须明确警告用户可能存在幻觉内容)。当 LLM 作为智能体时,必须只让它执行其实际能完成的任务,这一点至关重要。

Culture Eats Strategy / 成败在于文化而不是战略

Even a perfect strategy fails if team culture cannot execute it. LLMs have a "culture" defined by fine‑tuning, prompts, and the codebase they see. If the model keeps generating undesirable patterns, you must change its culture by adjusting prompts or refactoring the codebase.

默认情况下,LLM 处于 "潜在空间" 的特定区域:当你要求它生成代码时,其风格将基于微调方式以及截止当前的上下文窗口(包括系统提示和已读取的文件)。这种风格具有自我强化特性:如果上下文中存在大量使用某库的代码,LLM 会持续沿用该库;反之,若未提及该库且 LLM 未被微调为默认使用它,则不会主动采用。

Rule of Three / 三次法则

The Rule of Three says you may duplicate code once, but on the third copy you should refactor. LLMs love to duplicate code; they will often produce a new copy of the program with requested changes. To make them refactor, you must explicitly ask them to reduce duplication.

软件中的三次法则指出:你应当愿意复制一段代码一次,但当第三次出现时就应该进行重构。这是对 DRY(不要重复自己)原则的细化,考虑到消除重复的方法可能并不明显,而等待第三次出现往往能更清晰地明确重构方向。

Reflections / 感想

1. At present, code editors alone cannot directly improve code quality; they mainly speed up writing. AI editors like Cursor also cannot guarantee higher quality, but the speed gain lets you invest more effort into quality, potentially resulting in both more and better code compared to purely manual development.

2. Cursor will not generate code beyond your own understanding; the quality remains tied to your knowledge. In the first ~5,000 lines of a project you may experience a "honeymoon" with surprising implementations, but as the codebase grows the model's recall drops and you may need more manual oversight.

debuggingLLMAI codingsoftware engineeringbest practices
ELab Team
Written by

ELab Team

Sharing fresh technical insights

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.