Artificial Intelligence 6 min read

Transcribing Audio and Video to Text with OpenAI Whisper and Faster‑Whisper

This article explains how to use OpenAI's Whisper and the faster‑Whisper wrapper to quickly convert audio or video files into searchable text, covering installation, Python code examples, a Swift client, and a Flask‑based server API for practical transcription workflows.

IT Services Circle
IT Services Circle
IT Services Circle
Transcribing Audio and Video to Text with OpenAI Whisper and Faster‑Whisper

Converting audio and video files into text used to be difficult, but today it can be done in minutes with open‑source tools, enabling tasks such as subtitle extraction, searchable transcripts, and content analysis.

Whisper is OpenAI's open‑source speech‑to‑text model written in Python; after installing a few packages, a short script can produce transcriptions based on your machine's performance and the length of the media.

Fast‑Whisper builds on Whisper by re‑implementing the model with CTranslate2 , a fast inference engine for Transformer models. It claims 4‑8× speed improvements over the original, works on both GPU and CPU, and can run on modest hardware such as a Mac.

To use Fast‑Whisper you only need two steps:

Install the dependency package: pip install faster-whisper

Write a short Python script, for example: from faster_whisper import WhisperModel model_size = "large-v3" # Run on GPU with FP16 model = WhisperModel(model_size, device="cuda", compute_type="float16") segments, info = model.transcribe("audio.mp3", beam_size=5) print("Detected language '%s' with probability %f" % (info.language, info.language_probability)) for segment in segments: print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

The resulting transcript can be used to quickly locate interesting passages, generate subtitles, or feed text into downstream AI models.

Client side : a simple macOS application written in Swift lets users select a video, click “Extract Text” (which calls the Python backend), view timestamps, choose start and end points, and export the selected clip.

Server side : a Flask API wraps Fast‑Whisper for remote use. Example code:

from flask import Flask, request, jsonify
from faster_whisper import WhisperModel

app = Flask(__name__)
model_size = "large-v2"
model = WhisperModel(model_size, device="cpu", compute_type="int8")

@app.route('/transcribe', methods=['POST'])
def transcribe():
    file_path = request.json.get('filePath')
    segments, info = model.transcribe(file_path, beam_size=5, initial_prompt="简体")
    segments_txt = []
    for segment in segments:
        line = "%.2fs|%.2fs|[%.2fs -> %.2fs]|%s" % (segment.start, segment.end, segment.start, segment.end, segment.text)
        segments_txt.append(line)
    response_data = {
        "language": info.language,
        "language_probability": info.language_probability,
        "segments": segments_txt
    }
    return jsonify(response_data)

if __name__ == '__main__':
    app.run(debug=False)

The article concludes that this lightweight tool is sufficient for personal use and encourages readers to try it out.

pythonAIWhisperspeech-to-textaudio transcriptionFast-Whisper
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.