How CASCADE Enables LLM Agents to Learn from Experience During Live Deployment

The paper introduces CASCADE, a deployment‑time learning framework that lets LLM agents continuously select and reuse past cases via a contextual‑bandit approach, achieving higher long‑term success rates across diverse online tasks without updating the base model.

Machine Heart
Machine Heart
Machine Heart
How CASCADE Enables LLM Agents to Learn from Experience During Live Deployment

Research Background

Existing work on agent experience learning falls into two categories: (1) traditional machine‑learning pipelines that train on static datasets before testing, and (2) runtime learning that iteratively improves on the same data. Real deployments add a crucial temporal dimension—tasks arrive sequentially, agents cannot see future queries, and each interaction provides feedback that can be leveraged for improvement.

Deployment‑Time Learning (DTL) and CASCADE

DTL formalizes the deployment phase as an online learning problem where, at step t, the agent receives a query, generates an answer or action trajectory, and receives binary success/failure feedback. The goal is to maximize long‑term success (minimize regret) over the entire deployment sequence.

CASCADE implements DTL using a case‑based reasoning (CBR) framework with a four‑step 4R loop:

Retrieve : fetch candidate cases from a growing case library.

Reuse : provide the selected case as context to the LLM.

Revise : generate the final answer or action.

Retain : if feedback is positive, store the new interaction as a case.

The core challenge—choosing which case to retrieve—is modeled as a contextual bandit problem. CASCADE proposes the Neural‑LinLogUCB algorithm, which encodes query‑case interactions with a Transformer and estimates uncertainty via a linear head, enabling exploration‑exploitation trade‑offs under binary feedback.

Theoretical Insight

Regret is decomposed into (1) coverage gap (whether the case library contains sufficiently relevant experiences) and (2) retrieval regret (whether the retrieval policy picks the most useful case). As deployment proceeds, successful cases enrich the library, reducing coverage loss, while the bandit updates lower retrieval regret, yielding a no‑regret guarantee under reasonable assumptions.

DTLBench Benchmark

To evaluate DTL, the authors construct DTLBench, comprising 16 tasks across domains such as medical diagnosis, legal prediction, finance, IT operations, programming, embodied decision‑making, and information retrieval. Each task is presented as an online query stream, making per‑step success the primary metric.

Experimental Results

Using Qwen3‑32B as the base model, zero‑shot prompting achieves 48.33% average success on 12 single‑round tasks. The non‑parametric baseline NP‑CBR improves this to 63.76%, and CASCADE further raises it to 66.68%, demonstrating that case reuse plus online retrieval learning yields tangible gains.

Compared with the parameter‑updating baseline REINFORCE+LoRA, CASCADE outperforms on 9 of the 12 tasks and matches the rest, while requiring less than 4 GB of GPU memory because the base LLM remains frozen.

Across model scales (Qwen3‑4B/8B/14B/32B) and even on the black‑box Gemini‑2.0‑flash model, CASCADE consistently boosts average success rates (e.g., from 56.58% to 72.58% on Gemini). In multi‑turn environments such as ALFWorld and ScienceWorld, CASCADE improves success from 62.01% to 67.43% and from 59.36% to 66.84%, respectively, and also enhances performance in web‑search and EHR table‑reasoning tasks.

Conclusion

CASCADE answers a growing question in LLM‑agent deployment: how can agents continuously learn from ongoing feedback while keeping the base model fixed? By framing deployment as an online learning problem, providing a principled case‑based retrieval strategy, and releasing the DTLBench suite, the work opens a research direction for stable, long‑term adaptation of large‑model agents in real‑world task streams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMOnline LearningContextual BanditCASCADEDeployment-Time LearningCase-Based ReasoningDTLBench
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.