Rethinking LLM Agents: Stream Tool Outputs Directly to the Client

The article critiques the conventional LLM‑agent loop that forces every tool output back through the model, proposes a dual‑output architecture where tools stream multimedia events directly to the client while still returning a compact semantic result to the model, and demonstrates the design with Python code examples.

AI Waka
AI Waka
AI Waka
Rethinking LLM Agents: Stream Tool Outputs Directly to the Client

Why the Traditional Agent Model Falls Short

When building an LLM‑based agent, the common mental model is user → model thinking → optional tool call → model reply . This assumes that all tool output must return to the LLM before reaching the user . For short text results this works, but for music, speech, images, video, or any streaming data the model becomes an unnecessary bottleneck.

The LLM should act like a DJ, selecting tracks and orchestrating the flow, not like a singer that must reproduce every byte.

Current Tool‑Calling Workflow

OpenAI’s official flow: the application sends a prompt and tool definitions, the model decides to call a tool, the application executes the tool, returns the result to the model, and only then does the model continue its answer. Anthropic distinguishes client‑side and server‑side tools, highlighting that tool execution can live outside the model.

Proposed Dual‑Output Architecture

Instead of a single return path, a tool can emit two outputs:

A compact semantic result that goes back to the LLM for reasoning.

An event stream (audio, video, UI cards, etc.) that is sent directly to the client.

This separates the model’s planning role from the user‑experience delivery, reducing latency and avoiding redundant re‑serialization.

Concrete Implementation in Python

The article provides a runnable example that uses asyncio, contextvars, and AWS Polly to stream text‑to‑speech events while still returning a short status string to the LLM.

async def play_generate_music(prompt: str) -> str:
    try:
        send(StreamCmd(data={"channel": "audio", "type": "MusicGenerationStarted", "message": prompt}))
        async for audio_chunk in tts.compose(prompt):
            send(StreamCmd(data={"channel": "audio", "type": "MusicGenerationAudio", "data": audio_chunk}))
        send(StreamCmd(data={"channel": "audio", "type": "MusicGenerationEnded"}))
    except Exception as e:
        return f"Failed to generated music with prompt: '{prompt}'. Error: {e!s}"
    else:
        return f"Generated music with prompt: '{prompt}'"

A similar speak function streams Polly‑generated speech events ( SpeechStarted, SpeechAudio, SpeechEnded) to a request‑scoped queue.

async def speak(text: str) -> str:
    queue = get_queue_context()
    if not queue:
        raise RuntimeError("This function requires queue context")
    try:
        polly = boto3.client("polly", region_name="us-east-1")
        response = polly.synthesize_speech(Text=text, OutputFormat="mp3", VoiceId="Matthew", Engine="neural")
        audio_bytes = response["AudioStream"].read()
        queue.put_nowait(StreamCmd(data={"channel": "audio", "type": "SpeechStarted", "text": text}))
        queue.put_nowait(StreamCmd(data={"channel": "audio", "type": "SpeechAudio", "data": base64.b64encode(audio_bytes).decode()}))
        queue.put_nowait(StreamCmd(data={"channel": "audio", "type": "SpeechEnded"}))
    except Exception as e:
        return f"Failed to speak: {e!s}"
    else:
        return f"Spoke aloud: '{text}'"

The orchestration loop sends the user prompt to the LLM, receives tool calls, executes them, and pushes both the semantic result (to the model) and the streaming events (to the client) onto the queue.

Design Trade‑offs

Moving streaming to the client raises several system‑design questions:

Which events are visible to end users versus developers?

How should summaries be mirrored back to the LLM?

How to handle cancellation, retries, or back‑pressure for long‑running binary streams?

What happens if a tool streams partial output then fails?

Answering these requires robust runtime rules, often implemented with contextvars to provide request‑scoped access to the streaming queue.

Conclusion

The proposed pattern expands the agent’s capability from a simple function‑call model to a true multimodal runtime where tools can directly affect the user experience while still participating in the reasoning loop. This approach aligns with emerging frameworks such as MCP, Google ADK, and return_direct optimizations, and it offers a clear path toward lower latency and richer interactions.

architecturePythonLLMAgentmultimodalContextVarsTool Streaming
AI Waka
Written by

AI Waka

AI changes everything

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.