Artificial Intelligence 10 min read

Building a Web Voice Chatbot with Whisper, llama.cpp, and LLM

This article demonstrates how to build a web‑based voice chatbot by integrating Whisper speech‑to‑text, llama.cpp LLM inference, and WebSocket communication, detailing both the frontend JavaScript implementation and the Python FastAPI backend, along with Docker deployment and example code.

System Architect Go
System Architect Go
System Architect Go
Building a Web Voice Chatbot with Whisper, llama.cpp, and LLM

System Overview

The large language model (LLM) provides powerful text‑based dialogue capabilities. This article shows how to extend it to voice dialogue by using Whisper for speech recognition and llama.cpp for LLM inference, constructing a web‑based voice chatbot.

User provides voice input.

Speech is recognized and converted to text.

The text is fed to the LLM, which generates a textual response.

The response text is synthesized back to speech and played to the user.

System Implementation

The choice of client platform (web, desktop, mobile) determines how audio is captured and played back. In this example we use the web client, leveraging the browser's built‑in audio capture and playback APIs for interaction.

Web Frontend

The frontend is implemented with HTML5 and JavaScript. It uses the Web API for audio capture and speech synthesis. Below is a simplified code example.

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Voice Chat AI</title>
  <style>
    #loading { display: none; font-weight: bold; color: blue; }
    #response { white-space: pre-wrap; }
  </style>
</head>
<body>
  <h1>Voice Chat AI</h1>
  <button id="start">Start Recording</button>
  <button id="stop" disabled>Stop Recording</button>
  <p id="loading">Loading...</p>
  <p>AI Response: <span id="response"></span></p>
  <script>
    let audioContext, mediaRecorder;
    const startButton = document.getElementById("start");
    const stopButton = document.getElementById("stop");
    const responseElement = document.getElementById("response");
    const loadingElement = document.getElementById("loading");
    let socket = new WebSocket("ws://localhost:8765/ws");
    socket.onmessage = (event) => {
      const data = JSON.parse(event.data);
      const inputText = data.input || "No input detected";
      responseElement.textContent += `\nUser said: ${inputText}`;
      const aiResponse = data.response || "No response from AI";
      responseElement.textContent += `\nAI says: ${aiResponse}\n`;
      loadingElement.style.display = "none";
      const utterance = new SpeechSynthesisUtterance(aiResponse);
      speechSynthesis.speak(utterance);
    };
    socket.onerror = (error) => {
      console.error("WebSocket error:", error);
      loadingElement.style.display = "none";
    };
    startButton.addEventListener("click", async () => {
      audioContext = new (window.AudioContext || window.webkitAudioContext)();
      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
      mediaRecorder = new MediaRecorder(stream);
      const audioChunks = [];
      mediaRecorder.ondataavailable = (event) => {
        audioChunks.push(event.data);
      };
      mediaRecorder.onstop = () => {
        const audioBlob = new Blob(audioChunks, { type: "audio/webm" });
        loadingElement.style.display = "block";
        socket.send(audioBlob);
      };
      mediaRecorder.start();
      startButton.disabled = true;
      stopButton.disabled = false;
    });
    stopButton.addEventListener("click", () => {
      mediaRecorder.stop();
      startButton.disabled = false;
      stopButton.disabled = true;
    });
  </script>
</body>
</html>

The client sends recorded audio via WebSocket to the backend, receives the transcribed text and LLM response, and uses the browser's SpeechSynthesis API to play the answer.

WebSocket Server (Backend)

The backend is built with Python, FastAPI, and WebSocket. Whisper performs speech‑to‑text, and llama.cpp runs the LLM (e.g., llama3.2‑1B) to generate responses.

from fastapi import FastAPI, WebSocket
import uvicorn
import whisper
import tempfile
import os
import signal

app = FastAPI()

# Load Whisper model (default cache location ~/.cache/whisper)
model = whisper.load_model("base", download_root="WHISPER_MODEL")

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    try:
        await websocket.accept()
        while True:
            # Receive audio data
            audio_data = await websocket.receive_bytes()
            # Save temporary audio file
            with tempfile.NamedTemporaryFile(delete=False, suffix=".webm") as temp_audio:
                temp_audio.write(audio_data)
                temp_audio_path = temp_audio.name
            # Whisper transcription
            result = model.transcribe(temp_audio_path)
            os.remove(temp_audio_path)
            text = result["text"]
            print("user input:", text)
            # Generate AI reply (LLMResponse is a placeholder for llama.cpp inference)
            response_text = LLMResponse(text)
            print("AI response:", response_text)
            await websocket.send_json({"input": text, "response": response_text})
    except Exception as e:
        print("Error:", e)

def handle_shutdown(signal_num, frame):
    print(f"Received shutdown signal: {signal_num}")

def setup_signal_handlers():
    signal.signal(signal.SIGTERM, handle_shutdown)
    signal.signal(signal.SIGINT, handle_shutdown)

if __name__ == "__main__":
    setup_signal_handlers()
    config = uvicorn.Config("main:app", port=8765, log_level="info")
    server = uvicorn.Server(config)
    server.run()

llama.cpp can be run inside a Docker container to expose an HTTP LLM service. The following command starts the container with the desired model.

docker run -p 8080:8080 -v ~/ai-models:/models \
    ghcr.io/ggerganov/llama.cpp:server \
    -m /models/llama3.2-1B.gguf -c 512 \
    --host 0.0.0.0 --port 8080

An example Python client that calls the HTTP endpoint is shown below.

import requests
import json

class LlamaCppClient:
    def __init__(self, host="http://localhost", port=8080):
        self.base_url = f"{host}:{port}"

    def completion(self, prompt):
        url = f"{self.base_url}/v1/chat/completions"
        headers = {"Content-Type": "application/json"}
        payload = {
            "messages": [
                {"role": "system", "content": """You are a friendly conversation partner. Be natural, engaging, and helpful in our discussions. Respond to questions clearly and follow the conversation flow naturally."""},
                {"role": "user", "content": prompt}
            ]
        }
        try:
            response = requests.post(url, headers=headers, data=json.dumps(payload))
            response.raise_for_status()
            return response.json()
        except requests.exceptions.RequestException as e:
            return {"error": str(e)}

Conclusion

By combining the browser's speech capture/playback capabilities, Whisper's speech‑to‑text conversion, and llama.cpp's LLM inference, we successfully built a voice‑driven conversational system. Such a system can be applied to language tutoring, voice‑based Q&A, and other interactive AI scenarios.

(I am Lingxu, follow me for ad‑free technical content; I do not incite emotions and welcome discussion.)

References

https://github.com/openai/whisper

https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

https://github.com/fastapi/fastapi

https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisUtterance

JavaScriptPythonLLMWebSocketFastAPIVoiceChatWhisper
System Architect Go
Written by

System Architect Go

Programming, architecture, application development, message queues, middleware, databases, containerization, big data, image processing, machine learning, AI, personal growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.