Automating audiovisual production with free software

This is an automatic translation generated by artificial intelligence. May contain errors.

Automating audiovisual production with free software

At the OfiLibre we produce Cafés con OfiLibre on a weekly basis — informal 15-minute talks about free software and open knowledge. Each episode goes through a pipeline that spans from live recording to multi-platform publication. In previous guides we already explained how we edit videos with FFmpeg and what the general production process looks like.

This guide focuses on the stage that goes from the edited video to its publication with subtitles, a process we have almost entirely automated with free tools. We will cover how we use Whisper for transcription, Pyannote for speaker identification, FFmpeg for subtitle embedding, and the YouTube API for automated uploads.

Overview of the workflow

The full process can be summarized as follows:

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌─────────────┐
│  Edited     │     │  Whisper     │     │  Pyannote    │     │  FFmpeg     │
│  video      │────▶│ Transcription│────▶│ Diarization  │────▶│ Subtitle    │
│  (.mp4)     │     │  (.srt)      │     │ (who speaks) │     │ embedding   │
└─────────────┘     └──────────────┘     └──────────────┘     └──────┬──────┘
                                                                     │
                                                                     ▼
                                                              ┌─────────────┐
                                                              │  YouTube    │
                                                              │  API        │
                                                              │ (upload)    │
                                                              └─────────────┘

Each step runs from the command line and can be chained in a single script. Let’s look at each tool in detail.

Whisper: automatic speech recognition

Whisper is a speech recognition model developed by OpenAI and released under the MIT license. It runs locally, with no internet connection or data upload to external servers required, making it ideal for university environments where content privacy matters.

Installation

Whisper is installed as a Python package. Using a virtual environment is recommended:

python3 -m venv whisper-env
source whisper-env/bin/activate
pip install openai-whisper

To leverage a GPU (which speeds up the process 5 to 10 times), you need NVIDIA drivers and CUDA installed. Without a GPU, Whisper still works on CPU, just more slowly.

Basic usage

The most straightforward way to transcribe a file is:

whisper cafe-2026-05-14.mp4 --language es --model medium --output_format srt

This generates an .srt file with synchronized subtitles. The most relevant parameters are:

  • --language es: specifies the audio language. If omitted, Whisper detects it automatically, but specifying it improves accuracy.
  • --model medium: selects the model size. Available options are tiny, base, small, medium, and large. The medium model offers a good balance between accuracy and speed for Spanish.
  • --output_format srt: output format. Also supports vtt, txt, tsv, and json.

Available models

Model Parameters VRAM required Relative speed
tiny 39 M ~1 GB ~32x
base 74 M ~1 GB ~16x
small 244 M ~2 GB ~6x
medium 769 M ~5 GB ~2x
large 1550 M ~10 GB 1x

For Cafés con OfiLibre we use medium because it provides good quality in Spanish without requiring a high-end GPU. If your machine has a card with at least 10 GB of VRAM, large will deliver even better results.

Example output (.srt)

1
00:00:01,000 --> 00:00:04,500
Welcome to a new Café con OfiLibre.

2
00:00:04,500 --> 00:00:08,200
Today we are going to talk about free software licenses.

3
00:00:08,200 --> 00:00:12,800
We have our guest with us, who is going
to share her experience with the project.

Pyannote: knowing who speaks when

The transcription Whisper generates does not distinguish between speakers. To know who says what at each moment, we use Pyannote, a Python library specialized in speaker diarization, released under the MIT license.

Diarization is the process of segmenting audio by the people who speak in it, labeling each fragment with a speaker identifier (SPEAKER_00, SPEAKER_01, etc.).

Installation

pip install pyannote.audio

Pyannote requires accepting the model usage terms on Hugging Face and obtaining an access token.

Usage from Python

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HUGGINGFACE_TOKEN"
)

diarization = pipeline("cafe-2026-05-14.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f} - {turn.end:.1f}: {speaker}")

The output looks like this:

0.0 - 4.5: SPEAKER_00
4.5 - 12.8: SPEAKER_01
12.8 - 15.3: SPEAKER_00

Combining Whisper and Pyannote

The key step is cross-referencing Whisper subtitle timestamps with Pyannote diarization segments. For each subtitle block, we find the speaker with the greatest temporal overlap and assign their label:

import re

def parse_srt(srt_path):
    """Reads an .srt file and returns a list of blocks."""
    with open(srt_path) as f:
        content = f.read()
    pattern = r"(\d+)\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.*?)(?=\n\n|\Z)"
    return re.findall(pattern, content, re.DOTALL)

def time_to_seconds(t):
    """Converts '00:01:23,456' to seconds."""
    h, m, rest = t.split(":")
    s, ms = rest.split(",")
    return int(h)*3600 + int(m)*60 + int(s) + int(ms)/1000

def assign_speakers(srt_blocks, diarization):
    """Assigns a speaker to each subtitle block."""
    result = []
    for idx, start, end, text in srt_blocks:
        s = time_to_seconds(start)
        e = time_to_seconds(end)
        best_speaker = "UNKNOWN"
        best_overlap = 0
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            overlap = min(e, turn.end) - max(s, turn.start)
            if overlap > best_overlap:
                best_overlap = overlap
                best_speaker = speaker
        result.append((idx, start, end, f"[{best_speaker}] {text}"))
    return result

The result is an .srt file where each block indicates who is speaking:

1
00:00:01,000 --> 00:00:04,500
[SPEAKER_00] Welcome to a new Café con OfiLibre.

2
00:00:04,500 --> 00:00:08,200
[SPEAKER_00] Today we are going to talk about licenses.

3
00:00:08,200 --> 00:00:12,800
[SPEAKER_01] Thank you very much for the invitation.

The generic identifiers (SPEAKER_00, SPEAKER_01) can be manually replaced with the actual participant names using a simple find-and-replace:

sed -i 's/SPEAKER_00/Jesús/g; s/SPEAKER_01/María/g' subtitles.srt

FFmpeg: embedding subtitles in the video

Once we have the .srt file with the transcription and identified speakers, we use FFmpeg to embed the subtitles in the video. We have already covered FFmpeg in depth in our video generation guide, so here we focus on subtitling.

There are two ways to add subtitles:

Soft subtitles (soft subs)

They are packaged inside the video container as a separate track. The viewer can turn them on or off:

ffmpeg -i video.mp4 -i subtitles.srt \
  -c copy -c:s mov_text \
  video_with_subs.mp4

Hard subtitles (hard subs)

They are “burned” directly onto the video image. They are always visible and cannot be toggled off, but they guarantee visibility in any player:

ffmpeg -i video.mp4 \
  -vf "subtitles=subtitles.srt:force_style='FontSize=22,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2'" \
  -c:a copy \
  subtitled_video.mp4

The force_style parameters allow customization of subtitle appearance using ASS/SSA syntax. The most useful are:

  • FontSize: font size.
  • PrimaryColour: text color in &HAABBGGRR format.
  • OutlineColour: outline color.
  • Outline: outline thickness in pixels.
  • FontName: typeface (must be installed on the system).

For Cafés con OfiLibre we use soft subs for YouTube (the platform handles them natively) and hard subs for versions distributed through other channels.

YouTube API: automated uploads

The last step in the workflow is uploading the edited and subtitled video to YouTube. The YouTube Data API v3 allows automating this task from a script.

Prerequisites

Before using the API you need to create a project in the Google Cloud Console, enable the YouTube Data API v3, and obtain OAuth 2.0 credentials. The process is:

  1. Create a project in Google Cloud Console.
  2. Enable the “YouTube Data API v3”.
  3. Configure the OAuth consent screen.
  4. Create “OAuth client ID” credentials for a desktop application.
  5. Download the client_secrets.json file.

Dependencies

pip install google-api-python-client google-auth-oauthlib

Upload script

import os
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload

SCOPES = ["https://www.googleapis.com/auth/youtube.upload"]

def authenticate():
    """Authenticates with OAuth 2.0 and returns the service."""
    flow = InstalledAppFlow.from_client_secrets_file(
        "client_secrets.json", SCOPES
    )
    credentials = flow.run_local_server(port=0)
    return build("youtube", "v3", credentials=credentials)

def upload_video(youtube, filepath, title, description, tags):
    """Uploads a video to YouTube."""
    body = {
        "snippet": {
            "title": title,
            "description": description,
            "tags": tags,
            "categoryId": "27",  # Education
            "defaultLanguage": "es",
        },
        "status": {
            "privacyStatus": "unlisted",  # Unlisted initially
            "selfDeclaredMadeForKids": False,
        },
    }
    media = MediaFileUpload(filepath, chunksize=-1, resumable=True)
    request = youtube.videos().insert(
        part="snippet,status", body=body, media_body=media
    )
    response = request.execute()
    print(f"Video uploaded: https://www.youtube.com/watch?v={response['id']}")
    return response["id"]

def upload_captions(youtube, video_id, srt_path, language="es"):
    """Uploads subtitles to an already published video."""
    body = {
        "snippet": {
            "videoId": video_id,
            "language": language,
            "name": f"Subtitles ({language})",
        }
    }
    media = MediaFileUpload(srt_path)
    youtube.captions().insert(
        part="snippet", body=body, media_body=media
    ).execute()
    print(f"Subtitles ({language}) uploaded successfully.")

Usage

youtube = authenticate()

video_id = upload_video(
    youtube,
    filepath="cafe-2026-05-14-edited.mp4",
    title="Free software licenses - Café con OfiLibre",
    description="Talk about the most common free license types.",
    tags=["free software", "OfiLibre", "URJC", "licenses"],
)

upload_captions(youtube, video_id, "subtitles.srt", language="es")

The first time it runs, a browser window will open for authorization. Credentials can be saved to a local file so that subsequent runs require no manual intervention.

Putting it all together: the complete script

In practice, the entire workflow is chained in a bash script that takes the edited video file as a parameter and runs all steps sequentially:

#!/bin/bash
# automate-cafe.sh - Complete post-production workflow
# Usage: ./automate-cafe.sh edited_video.mp4 "Cafe title" "Description"

VIDEO="$1"
TITLE="$2"
DESC="$3"
BASENAME="${VIDEO%.*}"

echo "=== 1. Transcribing with Whisper ==="
whisper "$VIDEO" --language es --model medium --output_format srt \
  --output_dir ./subs/

echo "=== 2. Diarizing with Pyannote ==="
python3 diarize.py "$VIDEO" "./subs/${BASENAME}.srt" \
  --output "./subs/${BASENAME}_diarized.srt"

echo "=== 3. Embedding subtitles with FFmpeg ==="
ffmpeg -i "$VIDEO" -i "./subs/${BASENAME}_diarized.srt" \
  -c copy -c:s mov_text \
  "${BASENAME}_final.mp4"

echo "=== 4. Uploading to YouTube ==="
python3 upload_youtube.py \
  --video "${BASENAME}_final.mp4" \
  --srt "./subs/${BASENAME}_diarized.srt" \
  --title "$TITLE" \
  --description "$DESC"

echo "=== Process complete ==="

Hardware requirements

To run the entire workflow on a local machine, the requirements are moderate:

  • CPU: any modern processor (i5/Ryzen 5 or better). CPU-based transcription takes 3 to 5 times the video duration.
  • GPU (recommended): an NVIDIA card with at least 4 GB of VRAM for the Whisper medium model. With a GPU, transcribing a 15-minute video takes roughly 2-3 minutes.
  • RAM: 8 GB minimum, 16 GB recommended.
  • Storage: each edited episode takes between 200 MB and 1 GB depending on resolution.

Licenses of the software used

All software mentioned in this guide is free or open source:

  • Whisper: MIT license.
  • Pyannote: MIT license.
  • FFmpeg: LGPL / GPL licenses.
  • google-api-python-client: Apache 2.0 license.

This means you can use, modify, and redistribute them freely, for both personal and institutional projects.

Conclusion

Automating audiovisual production is not just about efficiency — it is also a way to ensure content accessibility. Automatic subtitles allow people with hearing disabilities to follow the talks, and diarization adds context about who is speaking.

The workflow we have described — Whisper, Pyannote, FFmpeg, YouTube API — runs entirely locally with free tools, does not depend on third-party services for processing, and can be adapted to any type of periodic audiovisual content.

If you want to check the full source code of the scripts we use, you can find them in our repository. And if you produce audiovisual content at your faculty or department, feel free to contact the OfiLibre so we can help you set up a similar workflow.