Easy video transcription and subtitling with Whisper, FFmpeg, and Python

Updated: April, 4th 2024

Video are not just a source of entertainment—they are a crucial tool for content creators, educators, and businesses alike. Enhancing your videos with accurate transcriptions and subtitles can significantly improve accessibility and viewer engagement. This guide will walk you through the exciting journey of transcribing your video using the cutting-edge OpenAI Whisper model and seamlessly adding subtitles with the powerful FFmpeg tool.

input.mp4

output.mp4

Required tools:

Before we set sail, let's ensure your toolkit is ready:

Python: Ensure it's installed on your machine for the coding magic.
FFmpeg: A cornerstone for handling video files. If it's missing from your toolbox, it's time for a quick setup.

Set up your workspace

First, you’ll want to create a dedicated workspace:

mkdir open-ai-whisper-ffmpeg

Navigate into your new project domain and conjure a virtual environment to keep things neat:

cd open-ai-whisper-ffmpeg
python3 -m venv .venv
source .venv/bin/activate

Install the required packages for OpenAI’s Whisper:

pip install git+https://github.com/m-bain/whisperx.git

Transcribe your video

First, create a new Python file, main.py:

touch main.py

Paste the code below into main.py:

from datetime import timedelta
import os
import whisperx

def transcribe_video(input_video):
    batch_size = 32 
    compute_type = "float32"  
    device = "cpu"

    model = whisperx.load_model("large-v2", device=device, compute_type=compute_type)

    audio = whisperx.load_audio(input_video)
    result = model.transcribe(audio, batch_size=batch_size, language="en")

    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
    result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

    segments = result["segments"]


   # if srt file exists, delete it
    if os.path.exists("subtitles.srt"):
        os.remove("subtitles.srt")
    for index, segment in enumerate(segments):
        startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
        endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
        text = segment['text']
        print(text)
        segment = f"{index + 1}\n{startTime} --> {endTime}\n{text[1:] if text[0] == ' ' else text}\n\n"

        srtFilename = os.path.join(f"subtitles.srt")
        with open(srtFilename, 'a', encoding='utf-8') as srtFile:
            srtFile.write(segment)

    return srtFilename



def main():
    input_video_path = "input.mp4"
    transcribe_video(input_video_path)

main()

Let’s examine what we’re doing in the code above. In these lines, we import the required packages to work with: whisperx to load whisper model, os to get subtitles file path, and timedelta to format text timestamps:

from datetime import timedelta
import os
import whisperx

Here, we defined a function that takes an input video, loads Whisper model "large-v2", specifies a compute_type, and configures the model to use CPU instead of GPU, and .

After that, the function loads video audio into the model, then transcribes the video audio. Finally, it aligns the model results and return text with timestamps:

def transcribe_video(input_video):
    batch_size = 32 
    compute_type = "float32"  
    device = "cpu"

    model = whisperx.load_model("large-v2", device=device, compute_type=compute_type)

    audio = whisperx.load_audio(input_video)
    result = model.transcribe(audio, batch_size=batch_size, language="en")

    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
    result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

    segments = result["segments"]

After that, the function loops through the model results segments, converts them into .srt format, and appends each word item into a subtitles.srt file:

   # if srt file exists, delete it
    if os.path.exists("subtitles.srt"):
        os.remove("subtitles.srt")
    for index, segment in enumerate(segments):
        startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
        endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
        text = segment['text']
        print(text)
        segment = f"{index + 1}\n{startTime} --> {endTime}\n{text[1:] if text[0] == ' ' else text}\n\n"

        srtFilename = os.path.join(f"subtitles.srt")
        with open(srtFilename, 'a', encoding='utf-8') as srtFile:
            srtFile.write(segment)

    return srtFilename

Adding subtitles to a video

Now, update main.py with the code mentioned below:

from datetime import timedelta
import os
import whisperx
import subprocess

def transcribe_video(input_video):
    batch_size = 32 
    compute_type = "float32"  
    device = "cpu"

    model = whisperx.load_model("large-v2", device=device, compute_type=compute_type)

    audio = whisperx.load_audio(input_video)
    result = model.transcribe(audio, batch_size=batch_size, language="en")

    model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
    result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

    segments = result["segments"]


   # if srt file exists, delete it
    if os.path.exists("subtitles.srt"):
        os.remove("subtitles.srt")
    for index, segment in enumerate(segments):
        startTime = str(0)+str(timedelta(seconds=int(segment['start'])))+',000'
        endTime = str(0)+str(timedelta(seconds=int(segment['end'])))+',000'
        text = segment['text']
        print(text)
        segment = f"{index + 1}\n{startTime} --> {endTime}\n{text[1:] if text[0] == ' ' else text}\n\n"

        srtFilename = os.path.join(f"subtitles.srt")
        with open(srtFilename, 'a', encoding='utf-8') as srtFile:
            srtFile.write(segment)

    return srtFilename

def add_srt_to_video(input_video, output_file):

    # FFmpeg command
    subtitles_file = 'subtitles.srt'

    # FFmpeg command
    ffmpeg_command = f"""ffmpeg -i {input_video} -vf "subtitles={subtitles_file}:force_style='FontName=Arial,FontSize=10,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,BorderStyle=3,Outline=1,Shadow=1,Alignment=2,MarginV=10'" -c:a copy {output_file} -y """

    # Run the FFmpeg command
    subprocess.run(ffmpeg_command, shell=True)
 


    input_video_path = "input.mp4"
    output_file = "output.mp4"
    transcribe_video(input_video_path)
    add_srt_to_video(input_video_path, output_file)


main()

Finally, we load the subtitles.srt into the video using FFmpeg and add subtitles as text in the video.

Here’s a sample video of this project:

And there you have it, a step-by-step guide to transforming your video into a masterpiece of clarity and engagement. Whether you're aiming to make your content more accessible or simply looking to add a professional touch, these tools empower you to achieve your goals.