How to Add Real‑Time Speech Recognition and Streaming TTS to Your AI Agent
This guide walks through choosing the right voice‑agent architecture, implementing streaming ASR with WebSocket, triggering sentence‑by‑sentence TTS, wiring the three layers together via async generators, optimizing latency to under a second, and avoiding common pitfalls such as missing VAD and checkpoint persistence.
01 Architecture trade‑off: choose based on your biggest fear
The article compares two ways to give an agent voice capabilities. End‑to‑end (S2S) sends audio directly to a multimodal model (e.g., gpt‑4o‑audio‑preview) and receives audio back in one step. Sandwich architecture separates the pipeline into STT → LangGraph Agent → TTS. The pros and cons are listed:
End‑to‑end: simple, low latency for short interactions, preserves tone, but limited model choices, weak tool use, hard to debug.
Sandwich: each layer can use different providers, full text‑model tool support, transparent debugging, but adds coordination complexity and loses tone information.
A decision matrix (tool calls, emotion preservation, model replaceability, rapid prototyping, production stability, latency) shows that for production‑grade agents the sandwich architecture is preferred.
02 STT core – streaming is the correct approach
Sending a whole recording to Whisper and waiting for the full transcript creates a 3‑4 s serial delay (recording → STT → Agent → TTS). The proper solution is a producer‑consumer streaming model where audio is sent to the STT service chunk by chunk and the transcript is emitted as soon as it is available.
Example using AssemblyAI’s streaming WebSocket API:
import WebSocket from 'ws';
import OpenAI from 'openai';
// ===================== Solution 1: AssemblyAI streaming =====================
interface STTEvent {
type: 'stt_chunk' | 'stt_output';
transcript: string;
ts: number;
}
class AssemblyAIStreaming {
private ws: WebSocket | null = null;
private buffer: STTEvent[] = [];
private resolve: ((event: STTEvent) => void) | null = null;
async connect(apiKey: string, sampleRate = 16000): Promise<void> {
const params = new URLSearchParams({ sample_rate: sampleRate.toString(), format_turns: 'true' });
this.ws = new WebSocket(`wss://streaming.assemblyai.com/v3/ws?${params}`, { headers: { Authorization: apiKey } });
this.ws.on('message', (data: Buffer) => {
const msg = JSON.parse(data.toString());
if (msg.type === 'Turn') {
const event: STTEvent = {
type: msg.turn_is_formatted ? 'stt_output' : 'stt_chunk',
transcript: msg.transcript,
ts: Date.now()
};
if (this.resolve) { this.resolve(event); this.resolve = null; }
else { this.buffer.push(event); }
}
});
await new Promise<void>(resolve => this.ws!.on('open', () => resolve()));
}
sendAudio(chunk: Buffer): void { this.ws?.send(chunk); }
async *receiveEvents(): AsyncGenerator<STTEvent> {
while (true) {
if (this.buffer.length > 0) { yield this.buffer.shift()!; }
else { yield await new Promise<STTEvent>(resolve => this.resolve = resolve); }
}
}
close(): void { this.ws?.close(); }
}
// ===================== Solution 2: OpenAI Whisper chunked =====================
const openai = new OpenAI();
async function* chunkingSTT(audioStream: AsyncIterable<Buffer>): AsyncGenerator<string> {
const CHUNK_INTERVAL_MS = 2000;
let buffer: Buffer[] = [];
let lastFlushTime = Date.now();
for await (const chunk of audioStream) {
buffer.push(chunk);
if (Date.now() - lastFlushTime >= CHUNK_INTERVAL_MS) {
const combined = Buffer.concat(buffer);
buffer = [];
lastFlushTime = Date.now();
const result = await openai.audio.transcriptions.create({
file: new File([combined], 'audio.pcm', { type: 'audio/pcm' }),
model: 'whisper-1',
language: 'zh',
response_format: 'text',
});
if (result.trim()) yield result;
}
}
if (buffer.length > 0) {
const result = await openai.audio.transcriptions.create({
file: new File([Buffer.concat(buffer)], 'audio.pcm', { type: 'audio/pcm' }),
model: 'whisper-1',
language: 'zh',
response_format: 'text',
});
if (result.trim()) yield result;
}
}03 TTS core – sentence‑triggered streaming
Waiting for the whole LLM reply before sending it to TTS adds another round‑trip delay. The correct pattern is to feed each completed sentence to the TTS engine as soon as a sentence‑ending punctuation appears.
import OpenAI from 'openai';
const openai = new OpenAI();
// Split token stream into sentences
async function* sentenceSplitter(tokenStream: AsyncIterable<string>): AsyncGenerator<string> {
const END_CHARS = new Set(['。', '!', '?', '.', '!', '?', '
']);
let buffer = '';
for await (const token of tokenStream) {
buffer += token;
if (END_CHARS.has(buffer[buffer.length - 1]) && buffer.trim().length > 0) {
yield buffer.trim();
buffer = '';
}
}
if (buffer.trim().length > 0) yield buffer.trim();
}
// Stream TTS for a single sentence (use tts‑1, not tts‑1‑hd)
async function* ttsStream(text: string): AsyncGenerator<Buffer> {
const response = await openai.audio.speech.create({
model: 'tts-1', // streaming‑optimized model
voice: 'alloy',
input: text,
response_format: 'pcm', // PCM can be played directly
});
const reader = response.body!.getReader();
while (true) {
const { done, value } = await reader.read();
if (done) break;
yield Buffer.from(value);
}
}
// Connect agent token stream → sentence splitter → TTS stream
async function* agentToSpeech(agentTokenStream: AsyncIterable<string>): AsyncGenerator<Buffer> {
for await (const sentence of sentenceSplitter(agentTokenStream)) {
for await (const audioChunk of ttsStream(sentence)) {
yield audioChunk;
}
}
}With this setup the audio for the first sentence starts playing immediately after the LLM emits it.
04 Complete sandwich pipeline – chaining three layers with AsyncGenerator
The three layers are wired together using async generators so that each layer’s output becomes the next layer’s input.
import { createReactAgent } from '@langchain/langgraph/prebuilt';
import { ChatOpenAI } from '@langchain/openai';
import { MemorySaver } from '@langchain/langgraph';
import { HumanMessage } from '@langchain/core/messages';
class VoiceAgentPipeline {
private agent: ReturnType<typeof createReactAgent>;
private memory = new MemorySaver();
private threadId: string;
constructor() {
this.agent = createReactAgent({
llm: new ChatOpenAI({ model: 'gpt-4o', streaming: true }),
tools: [], // add your tools here
checkpointSaver: this.memory,
});
this.threadId = `voice-session-${Date.now()}`;
}
// Agent receives transcript and yields token stream
private async *agentStream(transcript: string): AsyncGenerator<string> {
const stream = await this.agent.stream({ messages: [new HumanMessage(transcript)] },
{ configurable: { thread_id: this.threadId }, streamMode: 'messages' });
for await (const [message] of stream) {
if (message.content && typeof message.content === 'string' && message.getType() === 'ai') {
yield message.content;
}
}
}
// Full turn: audio → STT → Agent → TTS → audio
async *processTurn(audioStream: AsyncIterable<Buffer>): AsyncGenerator<Buffer> {
const stt = new AssemblyAIStreaming();
await stt.connect(process.env.ASSEMBLYAI_API_KEY!);
const producer = (async () => {
for await (const chunk of audioStream) { stt.sendAudio(chunk); }
stt.close();
})();
for await (const event of stt.receiveEvents()) {
if (event.type === 'stt_output') {
yield* agentToSpeech(this.agentStream(event.transcript));
}
}
await producer;
}
}The overall data flow is:
Microphone audio → STT WebSocket (real‑time text) → LangGraph Agent (token stream) → Sentence splitter → OpenAI TTS (audio chunks) → Speaker playback .
05 WebSocket service – browser ↔ server real‑time communication
Because the browser must continuously send audio and receive synthesized audio, a bidirectional WebSocket server is required. The Node.js example below receives base64‑encoded audio chunks, runs the pipeline, and streams back base64‑encoded PCM chunks.
import { WebSocketServer, WebSocket } from 'ws';
function createVoiceServer(port = 8765): void {
const wss = new WebSocketServer({ port });
wss.on('connection', ws => {
const pipeline = new VoiceAgentPipeline();
const audioQueue: Buffer[] = [];
let audioResolve: (() => void) | null = null;
let sessionEnded = false;
async function* audioStream(): AsyncGenerator<Buffer> {
while (!sessionEnded || audioQueue.length > 0) {
if (audioQueue.length > 0) { yield audioQueue.shift()!; }
else { await new Promise<void>(r => { audioResolve = r; }); }
}
}
ws.on('message', async data => {
const msg = JSON.parse(data.toString());
if (msg.type === 'audio_chunk' && msg.data) {
audioQueue.push(Buffer.from(msg.data, 'base64'));
audioResolve?.(); audioResolve = null;
} else if (msg.type === 'audio_end') {
sessionEnded = true; audioResolve?.();
}
});
(async () => {
for await (const chunk of pipeline.processTurn(audioStream())) {
if (ws.readyState === WebSocket.OPEN) {
ws.send(JSON.stringify({ type: 'audio_response', data: chunk.toString('base64') }));
}
}
ws.send(JSON.stringify({ type: 'response_end' }));
})().catch(console.error);
});
console.log(`Voice Agent service running at ws://localhost:${port}`);
}
createVoiceServer();On the client side the MediaRecorder API captures audio, sends it via the WebSocket, and plays back received PCM chunks with AudioContext. The AssemblyAI stream requires PCM 16 kHz mono; browsers record Opus by default, so a conversion step is needed.
06 Latency optimisation – how sub‑second response is achieved
Unoptimised production latency breakdown: STT ≈ 800 ms (wait for utterance end), Agent first token ≈ 1200 ms (wait for full reply), TTS first audio ≈ 600 ms (wait for full text) → total ≈ 2600 ms.
After applying the streaming and model choices described earlier, the numbers become: streaming STT (≈ 0 ms wait), Agent first token ≈ 800 ms, TTS first audio ≈ 200 ms → total perceived latency 800‑1000 ms.
Key optimisation items and their gains:
STT: switch from whole‑file upload to WebSocket streaming (saves 600‑1000 ms).
TTS trigger: emit on first sentence‑ending punctuation instead of waiting for full reply (saves > 400 ms).
TTS model: use tts-1 instead of tts-1‑hd (≈ 50 % lower latency).
Audio format: PCM 16 kHz mono avoids MP3 decoding (saves 50‑100 ms).
Chinese recognition: set language: 'zh' (improves accuracy 15‑20 %).
07 Common pitfalls and remedies
No VAD : system cannot detect end of speech, causing long silence. Use hark.js to emit audio_end after 500 ms of silence.
TTS interrupted by tool calls : token stream pauses, TTS receives half‑sentences. Add a timeout flush in sentenceSplitter (e.g., 300 ms without new token forces output).
Memory not persisted : MemorySaver is in‑memory only; restart loses conversation state. Switch to @langchain/langgraph-checkpoint-postgres for durable checkpoints.
Chinese Whisper accuracy low : omit language: 'zh' parameter. Always set it to improve accuracy by 15‑20 %.
ThreadId collisions in multi‑user scenarios : sharing a single threadId mixes memories. Bind a unique threadId to each user (e.g., `user-${userId}-voice`).
Conclusion
The guide dissects voice‑agent construction from architecture selection to end‑to‑end implementation. Six takeaways:
Pick the sandwich architecture for production stability and sub‑second latency.
Use streaming STT (WebSocket) to avoid waiting for utterance completion.
Trigger TTS on each sentence boundary for immediate playback.
Pay attention to PCM format, tts-1 model, and explicit language: 'zh' parameter.
Include a VAD component to detect speech pauses.
Persist agent checkpoints to a database; MemorySaver is only for demos.
The next article will combine all modules (LangChain, LangGraph, RAG, tools, memory) into a full‑stack production AI assistant.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
