Building a Web Voice Chatbot with Whisper, llama.cpp, and LLM
This article demonstrates how to build a web‑based voice chatbot by integrating Whisper speech‑to‑text, llama.cpp LLM inference, and WebSocket communication, detailing both the frontend JavaScript implementation and the Python FastAPI backend, along with Docker deployment and example code.
System Overview
The large language model (LLM) provides powerful text‑based dialogue capabilities. This article shows how to extend it to voice dialogue by using Whisper for speech recognition and llama.cpp for LLM inference, constructing a web‑based voice chatbot.
User provides voice input.
Speech is recognized and converted to text.
The text is fed to the LLM, which generates a textual response.
The response text is synthesized back to speech and played to the user.
System Implementation
The choice of client platform (web, desktop, mobile) determines how audio is captured and played back. In this example we use the web client, leveraging the browser's built‑in audio capture and playback APIs for interaction.
Web Frontend
The frontend is implemented with HTML5 and JavaScript. It uses the Web API for audio capture and speech synthesis. Below is a simplified code example.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Voice Chat AI</title>
<style>
#loading { display: none; font-weight: bold; color: blue; }
#response { white-space: pre-wrap; }
</style>
</head>
<body>
<h1>Voice Chat AI</h1>
<button id="start">Start Recording</button>
<button id="stop" disabled>Stop Recording</button>
<p id="loading">Loading...</p>
<p>AI Response: <span id="response"></span></p>
<script>
let audioContext, mediaRecorder;
const startButton = document.getElementById("start");
const stopButton = document.getElementById("stop");
const responseElement = document.getElementById("response");
const loadingElement = document.getElementById("loading");
let socket = new WebSocket("ws://localhost:8765/ws");
socket.onmessage = (event) => {
const data = JSON.parse(event.data);
const inputText = data.input || "No input detected";
responseElement.textContent += `\nUser said: ${inputText}`;
const aiResponse = data.response || "No response from AI";
responseElement.textContent += `\nAI says: ${aiResponse}\n`;
loadingElement.style.display = "none";
const utterance = new SpeechSynthesisUtterance(aiResponse);
speechSynthesis.speak(utterance);
};
socket.onerror = (error) => {
console.error("WebSocket error:", error);
loadingElement.style.display = "none";
};
startButton.addEventListener("click", async () => {
audioContext = new (window.AudioContext || window.webkitAudioContext)();
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorder = new MediaRecorder(stream);
const audioChunks = [];
mediaRecorder.ondataavailable = (event) => {
audioChunks.push(event.data);
};
mediaRecorder.onstop = () => {
const audioBlob = new Blob(audioChunks, { type: "audio/webm" });
loadingElement.style.display = "block";
socket.send(audioBlob);
};
mediaRecorder.start();
startButton.disabled = true;
stopButton.disabled = false;
});
stopButton.addEventListener("click", () => {
mediaRecorder.stop();
startButton.disabled = false;
stopButton.disabled = true;
});
</script>
</body>
</html>The client sends recorded audio via WebSocket to the backend, receives the transcribed text and LLM response, and uses the browser's SpeechSynthesis API to play the answer.
WebSocket Server (Backend)
The backend is built with Python, FastAPI, and WebSocket. Whisper performs speech‑to‑text, and llama.cpp runs the LLM (e.g., llama3.2‑1B) to generate responses.
from fastapi import FastAPI, WebSocket
import uvicorn
import whisper
import tempfile
import os
import signal
app = FastAPI()
# Load Whisper model (default cache location ~/.cache/whisper)
model = whisper.load_model("base", download_root="WHISPER_MODEL")
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
try:
await websocket.accept()
while True:
# Receive audio data
audio_data = await websocket.receive_bytes()
# Save temporary audio file
with tempfile.NamedTemporaryFile(delete=False, suffix=".webm") as temp_audio:
temp_audio.write(audio_data)
temp_audio_path = temp_audio.name
# Whisper transcription
result = model.transcribe(temp_audio_path)
os.remove(temp_audio_path)
text = result["text"]
print("user input:", text)
# Generate AI reply (LLMResponse is a placeholder for llama.cpp inference)
response_text = LLMResponse(text)
print("AI response:", response_text)
await websocket.send_json({"input": text, "response": response_text})
except Exception as e:
print("Error:", e)
def handle_shutdown(signal_num, frame):
print(f"Received shutdown signal: {signal_num}")
def setup_signal_handlers():
signal.signal(signal.SIGTERM, handle_shutdown)
signal.signal(signal.SIGINT, handle_shutdown)
if __name__ == "__main__":
setup_signal_handlers()
config = uvicorn.Config("main:app", port=8765, log_level="info")
server = uvicorn.Server(config)
server.run()llama.cpp can be run inside a Docker container to expose an HTTP LLM service. The following command starts the container with the desired model.
docker run -p 8080:8080 -v ~/ai-models:/models \
ghcr.io/ggerganov/llama.cpp:server \
-m /models/llama3.2-1B.gguf -c 512 \
--host 0.0.0.0 --port 8080An example Python client that calls the HTTP endpoint is shown below.
import requests
import json
class LlamaCppClient:
def __init__(self, host="http://localhost", port=8080):
self.base_url = f"{host}:{port}"
def completion(self, prompt):
url = f"{self.base_url}/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
"messages": [
{"role": "system", "content": """You are a friendly conversation partner. Be natural, engaging, and helpful in our discussions. Respond to questions clearly and follow the conversation flow naturally."""},
{"role": "user", "content": prompt}
]
}
try:
response = requests.post(url, headers=headers, data=json.dumps(payload))
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
return {"error": str(e)}Conclusion
By combining the browser's speech capture/playback capabilities, Whisper's speech‑to‑text conversion, and llama.cpp's LLM inference, we successfully built a voice‑driven conversational system. Such a system can be applied to language tutoring, voice‑based Q&A, and other interactive AI scenarios.
(I am Lingxu, follow me for ad‑free technical content; I do not incite emotions and welcome discussion.)
References
https://github.com/openai/whisper
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
https://github.com/fastapi/fastapi
https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisUtterance
System Architect Go
Programming, architecture, application development, message queues, middleware, databases, containerization, big data, image processing, machine learning, AI, personal growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.