Artificial Intelligence 6 min read

Transcribing Audio and Video to Text with OpenAI Whisper and Faster‑Whisper

This article explains how to use OpenAI's Whisper and the faster‑Whisper wrapper to quickly convert audio or video files into searchable text, covering installation, Python code examples, a Swift client, and a Flask‑based server API for practical transcription workflows.

IT Services Circle

Feb 28, 2024

Transcribing Audio and Video to Text with OpenAI Whisper and Faster‑Whisper

Converting audio and video files into text used to be difficult, but today it can be done in minutes with open‑source tools, enabling tasks such as subtitle extraction, searchable transcripts, and content analysis.

Whisper is OpenAI's open‑source speech‑to‑text model written in Python; after installing a few packages, a short script can produce transcriptions based on your machine's performance and the length of the media.

Fast‑Whisper builds on Whisper by re‑implementing the model with CTranslate2, a fast inference engine for Transformer models. It claims 4‑8× speed improvements over the original, works on both GPU and CPU, and can run on modest hardware such as a Mac.

To use Fast‑Whisper you only need two steps:

Install the dependency package: pip install faster-whisper Write a short Python script, for example:

from faster_whisper import WhisperModel

model_size = "large-v3"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")

segments, info = model.transcribe("audio.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

The resulting transcript can be used to quickly locate interesting passages, generate subtitles, or feed text into downstream AI models.

Client side : a simple macOS application written in Swift lets users select a video, click “Extract Text” (which calls the Python backend), view timestamps, choose start and end points, and export the selected clip.

Server side : a Flask API wraps Fast‑Whisper for remote use. Example code:

from flask import Flask, request, jsonify
from faster_whisper import WhisperModel

app = Flask(__name__)
model_size = "large-v2"
model = WhisperModel(model_size, device="cpu", compute_type="int8")

@app.route('/transcribe', methods=['POST'])
def transcribe():
    file_path = request.json.get('filePath')
    segments, info = model.transcribe(file_path, beam_size=5, initial_prompt="简体")
    segments_txt = []
    for segment in segments:
        line = "%.2fs|%.2fs|[%.2fs -> %.2fs]|%s" % (segment.start, segment.end, segment.start, segment.end, segment.text)
        segments_txt.append(line)
    response_data = {
        "language": info.language,
        "language_probability": info.language_probability,
        "segments": segments_txt
    }
    return jsonify(response_data)

if __name__ == '__main__':
    app.run(debug=False)

The article concludes that this lightweight tool is sufficient for personal use and encourages readers to try it out.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python AI Whisper speech-to-text audio transcription Fast-Whisper

Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.