How Context Engineering Turns AI Agents from ‘Usable’ to ‘Highly Effective’

The article explains how organizing the prompt, tool schemas, dialogue history, and retrieved documents—collectively the context window—affects an AI agent’s decisions, introduces the concepts of Lost‑in‑the‑Middle, Thinking Tokens, tool‑response caching, compaction versus SubAgent strategies, and shows a step‑by‑step evolution that raised accuracy from 60 % to over 95 %.

AI Tech Publishing
AI Tech Publishing
AI Tech Publishing
How Context Engineering Turns AI Agents from ‘Usable’ to ‘Highly Effective’

1. Effective Context Window

Although modern models support 128K‑1M tokens, agents still miss instructions, pick wrong tools, or hallucinate because relevant information can be lost in the middle of the context. This “Lost in the Middle” problem creates a U‑shaped performance curve: the model attends strongly to the beginning (primacy effect) and the end (recency effect) but weakly to the middle (proximity effect). In practice, a 1 M‑token window often yields only about 10 K useful tokens.

2. Context Engineering Principles

Without deliberate design, context quickly inflates: a single turn may contain system prompt, 20+ tool schemas, dialogue history, tool results, and retrieved documents, most of which are irrelevant noise. This leads to “context bloat” and “mode pollution”, where the model copies patterns from the context (e.g., a stray git commit && git push command being reproduced). Proper context engineering decides what to include, where to place it, and when to discard it.

3. Thinking Tokens: An Overlooked Blind Spot

Thinking Tokens are the model’s internal reasoning steps generated before the final answer. They receive a uniform attention distribution, causing an “attention sink” that dilutes focus on key information. Moreover, Thinking Tokens reduce KV‑cache hit rate because they are lengthy, nondeterministic, and not part of the final output. Some systems therefore replace free‑form thinking with structured prompts such as:

<thinking>
1. What does the user want?
2. What information do I have?
3. What are the key constraints?
4. What recommendation should I give?
</thinking>

This “structured thinking” prevents the reasoning from becoming a fog.

4. Tool Responses: Cache Killers

Within a single conversation tool responses do not affect the KV‑cache, but across requests they can break caching because tool outputs are often nondeterministic (timestamps, UUIDs, variable query order). When two users ask the same question, differing tool responses cause prefix mismatches and invalidate the cache. The recommended fix is to make tool responses deterministic by removing volatile fields, sorting by stable keys, and discarding irrelevant metadata.

5. Compaction and SubAgent

When a task overwhelms the context window—e.g., exploring large codebases or deep debugging—two strategies are possible.

Compaction : Summarize the current context and pass the summary to a new window, similar to “packing” a file.

SubAgent : Split the task into independent agents, each with a clean context, invoked like a CLI command, e.g.,

claude --prompt "investigate the authentication logic of this module"

.

Both approaches risk information loss: compression may omit details, while summarization may lose global context. Passive compaction (used by Claude Code) triggers automatically when the context exceeds a threshold, placing the summary at the start of a new window, which benefits from the primacy effect but invalidates the previous cache. Active compaction (used by Tape Systems) lets the model decide when to compress, preserving “anchor” points for later retrieval but risking over‑compression.

The author prefers passive compaction for most scenarios because active compression demands higher model judgment.

6. Case Study: Raising Accuracy from 60 % to 95 %+

The author iterated through six versions of an enterprise data‑labeling agent:

V1 : Simple Prompt + Search Tool — Accuracy 60 %.

V2 : Added detailed examples — Accuracy 70 %.

V3 : Structured output + Guardrails — No accuracy gain.

V4 : Added web search — Accuracy 75 %.

V5 : Split into three focused steps — Accuracy 85 %.

V6 : Full Context Engineering with on‑demand rules — Accuracy >95 %.

Key observations:

V1 suffered from insufficient context despite correct tools.

V2’s longer prompt (~4 K tokens) caused occasional “unruly” behavior and hallucinations.

V3’s structured output eliminated format errors but did not improve accuracy.

V4 improved coverage with web search but retained the “unruly” issue.

V5’s decomposition into small, focused contexts allowed the model to concentrate, boosting accuracy.

V6’s dynamic context injection (similar to skills/hooks) finally pushed accuracy above 95 %.

The trade‑off is the need for ongoing maintenance of the domain knowledge base; new problems require new rules.

Failed attempts included over‑aggressive input cleaning (removing UUIDs, hashes, etc.), shrinking JSON field names (which reduced performance by 5 %), swapping to larger models (which did not help), and training a dedicated model (which matched but did not exceed the base model).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentsLLMcompactionContext Engineeringsubagentthinking tokenstool caching
AI Tech Publishing
Written by

AI Tech Publishing

In the fast-evolving AI era, we thoroughly explain stable technical foundations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.