Artificial Intelligence 14 min read

Why Perfect Prompts Crash After Days: Uncovering the Limits of Context Engineering

An AI‑driven customer‑service bot that answered perfectly for two days suddenly started hallucinating because single‑turn prompt engineering ignored the continuous, stateful nature of real‑world conversations, revealing the hidden token, memory, and retrieval challenges that demand a new context‑engineering approach.

Big Data and Microservices

Apr 16, 2026

Why Perfect Prompts Crash After Days: Uncovering the Limits of Context Engineering

Case Study: Prompt Failure After Three Days

A smart‑customer‑service prototype built with Claude Code, a Milvus vector store, and a three‑layer retrieval pipeline performed flawlessly during a demo, but within three days the model began giving incorrect or fabricated answers, exposing a "memory loss" problem in multi‑turn dialogs.

Why Single‑Turn Prompt Engineering Hits a Wall

Traditional prompt engineering treats each model call as an independent, static instruction set. In production, however, AI agents operate in a dynamic, multi‑turn environment where context accumulates, attention dilutes, and token budgets are limited. The model can only retain the most recent one or two turns; earlier critical facts are pushed out of the context window or polluted by irrelevant retrieval results.

Introducing Context Engineering

Context engineering expands the focus from a single prompt to a full‑stack system that continuously curates, formats, and injects the right information at the right time. It is analogous to providing a scholar with a well‑organized library, an assistant, and a running log of past discussions rather than just a one‑off question note.

Three Practical Dimensions

1. Token Economics

Sliding Window & Summarization: Keep the raw content of the most recent N turns and compress older turns into a concise background summary to save tokens.

Heuristic Compression Rules: Discard verbose tool‑call logs or intermediate reasoning while preserving user‑stated facts and key decision points.

Cost/Accuracy/Latency Trade‑off: Test different compression strategies on a validation set to find the sweet spot between token cost and answer quality.

2. Memory Hierarchy

Short‑Term Memory (Working Area): Store the current session’s state machine, recent turns, and a short summary. Example state flow:

识别意图 → 查询知识库 → 确认细节 → 给出方案

Long‑Term Memory (Disk): Persist user profiles, preferences, and cross‑session conclusions in a vector database; retrieve only the most relevant entries during a conversation.

External Knowledge Mounting: Retrieve documents from product manuals or knowledge bases, format them into a concise list, and inject the formatted snippet rather than raw text.

3. RAG + Context Coordination

Result Re‑ranking & Filtering: After a vector search, apply keyword, metadata, or business‑rule filters to promote domain‑relevant documents and suppress noise.

Retrieve‑Compress‑Inject Pipeline: Retrieve relevant passages, summarize them with a small model, and inject only the distilled conclusions into the main context.

Tool‑Call Feedback Handling: Parse large JSON or log outputs from tools into short, human‑readable summaries before feeding them to the LLM.

Reference Design: Conversation State Machine + Context Manager

The following simplified Python example demonstrates how to combine a short‑term state machine with a context manager that assembles system prompts, short‑term memory, long‑term preferences, and RAG knowledge before each model call.

class ConversationState:
    """Manage short‑term memory for a dialogue session"""
    def __init__(self, session_id):
        self.session_id = session_id
        self.current_phase = "greeting"  # greeting, qa, troubleshooting, closed
        self.entities = {}  # e.g., {"order_id": "123456", "issue_type": "delayed"}
        self.last_summary = ""

class ContextManager:
    """Assemble the final context for the LLM"""
    def __init__(self, state_manager, rag_engine, long_term_memory_db):
        self.state_manager = state_manager
        self.rag_engine = rag_engine
        self.long_term_memory_db = long_term_memory_db

    def assemble_context(self, user_query, session_id):
        # 1. Update short‑term state
        state = self.state_manager.get_and_update_state(session_id, user_query)
        # 2. Build context parts
        context_parts = []
        system_prompt = self._get_system_prompt_for_phase(state.current_phase)
        context_parts.append(system_prompt)
        short_term_mem = f"对话背景：{state.last_summary}
最近对话：{self._get_recent_turns(session_id, turns=2)}"
        context_parts.append(short_term_mem)
        user_prefs = self.long_term_memory_db.retrieve_relevant_prefs(session_id, user_query, limit=2)
        if user_prefs:
            context_parts.append(f"用户偏好提示：{user_prefs}")
        if state.current_phase == "qa":
            docs = self.rag_engine.retrieve(user_query, filters={"category": state.entities.get("topic")})
            compressed_knowledge = self._summarize_for_task(docs, task="answer_question")
            context_parts.append(f"相关知识：{compressed_knowledge}")
        context_parts.append(f"用户最新问题：{user_query}")
        # 3. Token‑aware truncation
        final_context = self._truncate_by_tokens(context_parts, max_tokens=8000)
        return final_context, state

    def _truncate_by_tokens(self, parts, max_tokens):
        """Simple token‑aware truncation: keep latest question and system prompt, compress older history"""
        # Implementation omitted for brevity
        ...

Engineering Trade‑offs

turns=2

: Keep the raw content of the last two turns; earlier turns are summarized because their marginal value drops sharply after the second turn. limit=2: Retrieve at most two long‑term preference entries to avoid over‑personalization noise. max_tokens=8000: Reserve enough tokens for the model’s generation window (e.g., a 128K context model) while staying within input limits.

Prioritize Intent: The current user intent is never trimmed; all other parts are compressed first.

Conclusion

Prompt engineering alone cannot guarantee reliable behavior for LLM‑powered applications that run for weeks or months. By treating the model’s context as a scarce resource and building a layered memory system—short‑term state, long‑term preference store, and RAG‑enhanced knowledge—developers can create robust, cost‑effective pipelines that keep the model “aware” of the entire conversation history.

LLM prompt engineering RAG token management Context Engineering Conversation State

Written by

Big Data and Microservices

Focused on big data architecture, AI applications, and cloud‑native microservice practices, we dissect the business logic and implementation paths behind cutting‑edge technologies. No obscure theory—only battle‑tested methodologies: from data platform construction to AI engineering deployment, and from distributed system design to enterprise digital transformation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.