Artificial Intelligence 23 min read

Enhancing Interactive Agents with Large Language Models: The SwiftSage Framework

This article reviews recent advances in using large language models for embodied interactive agents, introduces the dual‑modality SwiftSage architecture that combines a fast T5‑based small model with a powerful large model for planning and grounding, and evaluates its performance on benchmarks such as ScienceWorld.

DataFunSummit
DataFunSummit
DataFunSummit
Enhancing Interactive Agents with Large Language Models: The SwiftSage Framework

The popularity of large language models (LLMs) has shifted model inference from pure text to interactive agents capable of reasoning in dynamic, physical environments. Traditional text‑only interactions are limited, prompting research into agents that can perceive and act upon real‑world objects, exemplified by tasks like moving a teapot onto a bookshelf.

Early benchmarks such as AFL World offered simple tasks, while the more challenging ScienceWorld benchmark from AI2 provides over 30 tasks, 10 domains, and thousands of objects with rich physical states, serving as a testbed for complex interactive reasoning.

Baseline approaches rely on reinforcement learning (e.g., DRNN) or behavior cloning, but suffer from large action spaces and limited long‑term planning. Knowledge‑augmented RL, SayCan, ReAct, and Reflexion improve performance by integrating LLMs for action prediction, self‑reflection, and re‑ranking, yet they still incur high token costs.

SwiftSage introduces a two‑stage framework: a lightweight T5‑large model predicts actions quickly, and when it encounters difficulty (e.g., low score or infeasible action), a large LLM is invoked for detailed planning. The process separates planning (generating a high‑level plan) from grounding (executing actions in the environment), reducing token usage while maintaining or improving scores.

Extensive experiments on ScienceWorld show that SwiftSage outperforms baselines in both effectiveness and efficiency, achieving higher scores with roughly one‑third the token consumption of pure LLM approaches. Limitations include dependence on oracle agents for training data, reliance on commercial LLM APIs, and challenges in obtaining fine‑grained feedback from real robots.

Future directions propose expanding to more complex, real‑world tasks, distilling planning abilities into open‑source models, and tighter integration with robotic hardware for low‑level action execution.

The accompanying Q&A clarifies that the small model can be RL‑based but seq2seq offers more stability, and contrasts SwiftSage with Google’s multi‑agent social simulation, highlighting SwiftSage’s task‑oriented physical reasoning focus.

large language modelsBenchmarkreinforcement learningplanningAI2interactive agentsSwiftSage
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.