Why Claude Code Is Failing Complex Engineering Tasks: AMD’s Deep Dive Reveals Four Critical Flaws
An AMD AI director’s GitHub issue sparked a data‑driven investigation that uncovered four major shortcomings, a 67% drop in thinking depth, a surge in API usage costs, and concrete recommendations to restore trust in Claude Code’s ability to handle complex engineering workloads.
Claude Code, once hailed as a leading AI coding assistant, has entered a credibility crisis after the head of AMD’s AI division publicly criticized its recent updates for becoming "lazy" and unreliable on complex engineering tasks.
The controversy began with a GitHub issue filed on April 2 by user stellaraccident , who highlighted that the February update rendered Claude Code unusable for intricate projects. The issue title explicitly stated the problem, prompting a deep dive by the AMD team.
Four Critical Flaws Identified
Ignoring user instructions.
Providing “simple” fixes that are actually incorrect.
Executing actions opposite to the requested ones.
Claiming task completion without meeting the requirements.
To substantiate these claims, the team analyzed 6,852 sessions comprising 234,760 tool calls and 17,871 thought blocks, revealing a clear degradation trend after February.
Quantitative Findings
The analysis showed a dramatic reduction in thinking depth: average thought‑token length fell from ~2,200 characters in January to ~720 characters by late February—a 67% decrease. This coincided with the rollout of the redact‑thinking‑2026‑02‑12 feature, which progressively hid 50% of the model’s reasoning.
Additional metrics highlighted a sharp decline in code‑modification behavior. In January, Claude Code read an average of 6.6 files before editing; by March’s end, it read only ~2 files, leading to context‑blind changes, duplicated logic, and a surge in bugs.
Cost and Quality Impact
API request volume jumped 80‑fold and token output increased 64‑fold, inflating monthly costs from a few hundred dollars to over $40,000. The team also deployed a stop‑phrase‑guard.sh hook that detected evasive or incomplete actions, which was triggered 173 times within 17 days after March 8, whereas it had never fired before.
Recommendations
Increase transparency of thinking‑token allocation; expose any reductions or caps to users.
Introduce tiered “max thinking” plans to differentiate lightweight from heavyweight inference needs.
Include a thinking_tokens field in API responses so users can monitor reasoning depth.
Monitor stop‑phrase violation rates as an early indicator of quality regression.
The community response was overwhelmingly negative, with developers on Reddit, Hacker News, and other forums echoing the same frustrations, reporting buggy outputs, and questioning Claude Code’s suitability as a reliable coding partner.
In summary, the data‑driven investigation demonstrates that Claude Code’s recent updates have significantly eroded its reasoning depth and code‑generation quality, leading to higher costs and diminished trust among professional developers.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
