How Short‑Term vs Long‑Term Memory Works in LLM‑Powered Autonomous Agents

This article demystifies short‑term and long‑term memory in LLM‑driven autonomous agents, explaining their mechanisms, limitations, and practical implementations such as sliding windows, summarization, and vector‑based retrieval, while illustrating each concept with concrete Cherry Studio examples and relevant research references.

Wuming AI
Wuming AI
Wuming AI
How Short‑Term vs Long‑Term Memory Works in LLM‑Powered Autonomous Agents

Short‑Term Memory

Short‑term memory holds the immediate context of a single conversation by feeding the previous dialogue turns and any intermediate reasoning (e.g., Chain‑of‑Thought) into the LLM prompt. The LLM sees the concatenated history as part of its input parameters, so the size of the memory is bounded by the model’s context window.

Working mechanism : The agent appends each new turn to the request payload; the LLM processes the whole prompt.

Limitation : When the accumulated dialogue exceeds the context window, the oldest turns are dropped, causing “forgetting”.

Typical mitigation strategies :

Sliding window : Keep only the most recent N turns.

Summarization : Replace older turns with a concise summary that preserves essential information.

Demonstration (Cherry Studio) : Four sequential queries – “Hello”, “…”, “Great”, “No need” – are sent. After the fourth query the first turn disappears from the prompt, illustrating the sliding‑window effect. The “context count” setting in the UI controls how many turns are retained.

Because most LLM providers charge per input and output token, longer contexts increase cost (see Deepseek pricing page: https://api-docs.deepseek.com/quick_start/pricing). Moreover, research on context rotation shows that extending the context can degrade model performance (Trychroma, https://research.trychroma.com/context-rot).

Long‑Term Memory

Long‑term memory enables an agent to store, retrieve, and reuse information across days, months, or years. The common implementation uses Retrieval‑Augmented Generation (RAG): important facts are embedded into vectors and persisted in a vector database; at query time, semantic similarity retrieves the most relevant chunks.

Technical implementation : Convert salient information into embeddings, store them in a vector store, and query the store with the current context.

Memory categories :

Episodic memory – records specific experiences, e.g., “User was on a business trip to Shanghai last Tuesday and liked the coffee there”.

Semantic memory – stores abstract facts, e.g., “User is allergic to peanuts”.

Procedural memory – captures skills or SOPs, e.g., “Agent knows how to call a particular API”.

Cherry Studio implementation : The platform provides a “global memory” feature.

Users may manually add entries or enable automatic memory creation, where the agent decides what to persist.

When global memory is on, each new user turn triggers the Memory_Search tool, which retrieves relevant stored snippets and injects them into the LLM prompt.

After the LLM generates a response, an asynchronous post‑processing step extracts new facts from the conversation and updates the memory store via dedicated tools (add, modify, delete).

Practical considerations

Automatic memory decisions can introduce incorrect facts if the agent misinterprets user intent.

Even with retrieved memories, a less capable model may over‑rely on them, producing answers that are irrelevant or misleading.

Full source code is available at https://github.com/CherryHQ/cherry-studio.

In summary, short‑term memory is realized by passing dialogue history as prompt parameters and works only within a single session, while long‑term memory relies on persistent vector stores accessed via RAG, enabling cross‑session knowledge retention. Each approach has distinct trade‑offs in cost, context limits, and reliability.

memory managementLLMprompt engineeringRAGCherry Studioautonomous agents
Wuming AI
Written by

Wuming AI

Practical AI for solving real problems and creating value

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.