Artificial Intelligence 20 min read

Integrating LLMs with Speech: Whisper, Vosk, and Alibaba Cloud in Python and JavaScript

This tutorial walks through setting up local speech recognition with OpenAI's Whisper and Vosk, leveraging Alibaba Cloud's ASR services, building a WebSocket server/client for real‑time audio streaming, capturing audio in the browser via MediaRecorder or RecordRTC, and performing speech synthesis with pyttsx3 and Alibaba's Sambert model.

Woodpecker Software Testing

Jan 25, 2026

Integrating LLMs with Speech: Whisper, Vosk, and Alibaba Cloud in Python and JavaScript

5.1.1 Local audio file recognition with Whisper

The Whisper model from OpenAI can be run locally. After installing transformers and accelerate (optionally via the Chinese mirror https://hf-mirror.com), the script audio_recog_whisper.py sets the HF_ENDPOINT environment variable, loads the openai/whisper-tiny model on CPU, and transcribes ./test.wav. The result is converted from Traditional to Simplified Chinese using opencc-python-reimplemented. Sample output shows the original Traditional text and the converted Simplified text.

5.1.2 Vosk speech‑recognition model

Vosk can be installed with pip install vosk. A small 50 MB model ( vosk-model-small-cn-0.22.zip) or the full 1.8 GB model can be downloaded via wget. After extracting the model, the script installs SpeechRecognition, vosk, and pyaudio, sets VOSK_MODEL_PATH, and defines a helper recognize_vosk_fixed that writes the audio to a temporary WAV file, runs Vosk's KaldiRecognizer, and returns the concatenated transcript. Running local_vosk.py on test.wav prints the recognized text.

5.1.3 Alibaba Cloud ASR

Two Python examples demonstrate Alibaba Cloud's speech‑to‑text service. aliyun_record_1.py shows a synchronous file‑based call using the paraformer-realtime-v2 model, handling API key loading, result aggregation, and error reporting. aliyun_record_2.py illustrates asynchronous batch transcription of multiple public‑URL audio files, waiting for the task to finish and printing each transcript. A real‑time version ( aliyun_real.py) creates a WebSocket‑enabled recognizer, reads audio frames from a microphone via pyaudio, sends them to the recognizer, and prints partial and final results.

5.1.4 WebSocket communication

A simple asynchronous WebSocket server ( socket_server.py) echoes messages received from a client. The client HTML ( socket_client.html) connects to ws://localhost:8765, sends user input, and displays the server’s reply.

5.1.5 JavaScript speech capture

The browser’s MediaRecorder API is used in media_recorder.html to record audio in audio/webm chunks every 200 ms, store them in recordedBlobs, and play back the result. Because MediaRecorder cannot output WAV, the third‑party RecordRTC.js library is employed in audio_client.html to record WAV audio (48 kHz, mono) in 500 ms slices, sending each blob over the WebSocket to the backend.

5.1.6 Speech synthesis

Local synthesis uses pyttsx3: the engine is initialized, volume set to 1.0, rate reduced by 50, and a Chinese sentence is spoken with pyttsx3.speak. Alibaba Cloud’s Sambert TTS is accessed via SpeechSynthesizer.call (model sambert-zhiqi-v1) in aliyun_sambert_base.py, saving the returned MP3 to jerry.mp3. An alternative version streams audio with pyaudio using a custom callback class.

5.1.7 Front‑end playback of synthesized audio

A FastAPI service ( sambert.py) exposes a /speech endpoint that calls the Sambert model, writes the MP3 to ../static/sambert.mp3, and returns its URL. The front‑end page ( sambert.html) lets the user enter text, POST it to the endpoint, and plays the returned audio via an Audio object.

Overall, the article provides end‑to‑end code for local and cloud speech‑to‑text, real‑time streaming via WebSocket, browser audio capture, and text‑to‑speech synthesis, enabling developers to build interactive voice‑enabled agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JavaScript Python WebSocket Alibaba Cloud speech recognition Whisper Vosk

Written by

Woodpecker Software Testing

The Woodpecker Software Testing public account shares software testing knowledge, connects testing enthusiasts, founded by Gu Xiang, website: www.3testing.com. Author of five books, including "Mastering JMeter Through Case Studies".

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.