Enhancing Interactive Agents with Large Language Models: The SwiftSage Framework
This article reviews recent advances in using large language models for embodied interactive agents, introduces the dual‑modality SwiftSage architecture that combines a fast T5‑based small model with a powerful large model for planning and grounding, and evaluates its performance on benchmarks such as ScienceWorld.
The popularity of large language models (LLMs) has shifted model inference from pure text to interactive agents capable of reasoning in dynamic, physical environments. Traditional text‑only interactions are limited, prompting research into agents that can perceive and act upon real‑world objects, exemplified by tasks like moving a teapot onto a bookshelf.
Early benchmarks such as AFL World offered simple tasks, while the more challenging ScienceWorld benchmark from AI2 provides over 30 tasks, 10 domains, and thousands of objects with rich physical states, serving as a testbed for complex interactive reasoning.
Baseline approaches rely on reinforcement learning (e.g., DRNN) or behavior cloning, but suffer from large action spaces and limited long‑term planning. Knowledge‑augmented RL, SayCan, ReAct, and Reflexion improve performance by integrating LLMs for action prediction, self‑reflection, and re‑ranking, yet they still incur high token costs.
SwiftSage introduces a two‑stage framework: a lightweight T5‑large model predicts actions quickly, and when it encounters difficulty (e.g., low score or infeasible action), a large LLM is invoked for detailed planning. The process separates planning (generating a high‑level plan) from grounding (executing actions in the environment), reducing token usage while maintaining or improving scores.
Extensive experiments on ScienceWorld show that SwiftSage outperforms baselines in both effectiveness and efficiency, achieving higher scores with roughly one‑third the token consumption of pure LLM approaches. Limitations include dependence on oracle agents for training data, reliance on commercial LLM APIs, and challenges in obtaining fine‑grained feedback from real robots.
Future directions propose expanding to more complex, real‑world tasks, distilling planning abilities into open‑source models, and tighter integration with robotic hardware for low‑level action execution.
The accompanying Q&A clarifies that the small model can be RL‑based but seq2seq offers more stability, and contrasts SwiftSage with Google’s multi‑agent social simulation, highlighting SwiftSage’s task‑oriented physical reasoning focus.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.