This is an automatic translation generated by artificial intelligence. May contain errors.
Automating audiovisual production with free software
At the OfiLibre we produce Cafés con OfiLibre on a weekly basis — informal 15-minute talks about free software and open knowledge. Each episode goes through a pipeline that spans from live recording to multi-platform publication. In previous guides we already explained how we edit videos with FFmpeg and what the general production process looks like.
This guide focuses on the stage that goes from the edited video to its publication with subtitles, a process we have almost entirely automated with free tools. We will cover how we use Whisper for transcription, Pyannote for speaker identification, FFmpeg for subtitle embedding, and the YouTube API for automated uploads.
Overview of the workflow
The full process can be summarized as follows:
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐
│ Edited │ │ Whisper │ │ Pyannote │ │ FFmpeg │
│ video │────▶│ Transcription│────▶│ Diarization │────▶│ Subtitle │
│ (.mp4) │ │ (.srt) │ │ (who speaks) │ │ embedding │
└─────────────┘ └──────────────┘ └──────────────┘ └──────┬──────┘
│
▼
┌─────────────┐
│ YouTube │
│ API │
│ (upload) │
└─────────────┘
Each step runs from the command line and can be chained in a single script. Let’s look at each tool in detail.
Whisper: automatic speech recognition
Whisper is a speech recognition model developed by OpenAI and released under the MIT license. It runs locally, with no internet connection or data upload to external servers required, making it ideal for university environments where content privacy matters.
Installation
Whisper is installed as a Python package. Using a virtual environment is recommended:
python3 -m venv whisper-env
source whisper-env/bin/activate
pip install openai-whisper
To leverage a GPU (which speeds up the process 5 to 10 times), you need NVIDIA drivers and CUDA installed. Without a GPU, Whisper still works on CPU, just more slowly.
Basic usage
The most straightforward way to transcribe a file is:
whisper cafe-2026-05-14.mp4 --language es --model medium --output_format srt
This generates an .srt file with synchronized subtitles. The most relevant parameters are:
--language es: specifies the audio language. If omitted, Whisper detects it automatically, but specifying it improves accuracy.--model medium: selects the model size. Available options aretiny,base,small,medium, andlarge. Themediummodel offers a good balance between accuracy and speed for Spanish.--output_format srt: output format. Also supportsvtt,txt,tsv, andjson.
Available models
| Model | Parameters | VRAM required | Relative speed |
|---|---|---|---|
| tiny | 39 M | ~1 GB | ~32x |
| base | 74 M | ~1 GB | ~16x |
| small | 244 M | ~2 GB | ~6x |
| medium | 769 M | ~5 GB | ~2x |
| large | 1550 M | ~10 GB | 1x |
For Cafés con OfiLibre we use medium because it provides good quality in Spanish without requiring a high-end GPU. If your machine has a card with at least 10 GB of VRAM, large will deliver even better results.
Example output (.srt)
1
00:00:01,000 --> 00:00:04,500
Welcome to a new Café con OfiLibre.
2
00:00:04,500 --> 00:00:08,200
Today we are going to talk about free software licenses.
3
00:00:08,200 --> 00:00:12,800
We have our guest with us, who is going
to share her experience with the project.
Pyannote: knowing who speaks when
The transcription Whisper generates does not distinguish between speakers. To know who says what at each moment, we use Pyannote, a Python library specialized in speaker diarization, released under the MIT license.
Diarization is the process of segmenting audio by the people who speak in it, labeling each fragment with a speaker identifier (SPEAKER_00, SPEAKER_01, etc.).
Installation
pip install pyannote.audio
Pyannote requires accepting the model usage terms on Hugging Face and obtaining an access token.
Usage from Python
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HUGGINGFACE_TOKEN"
)
diarization = pipeline("cafe-2026-05-14.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{turn.start:.1f} - {turn.end:.1f}: {speaker}")
The output looks like this:
0.0 - 4.5: SPEAKER_00
4.5 - 12.8: SPEAKER_01
12.8 - 15.3: SPEAKER_00
Combining Whisper and Pyannote
The key step is cross-referencing Whisper subtitle timestamps with Pyannote diarization segments. For each subtitle block, we find the speaker with the greatest temporal overlap and assign their label:
import re
def parse_srt(srt_path):
"""Reads an .srt file and returns a list of blocks."""
with open(srt_path) as f:
content = f.read()
pattern = r"(\d+)\n(\d{2}:\d{2}:\d{2},\d{3}) --> (\d{2}:\d{2}:\d{2},\d{3})\n(.*?)(?=\n\n|\Z)"
return re.findall(pattern, content, re.DOTALL)
def time_to_seconds(t):
"""Converts '00:01:23,456' to seconds."""
h, m, rest = t.split(":")
s, ms = rest.split(",")
return int(h)*3600 + int(m)*60 + int(s) + int(ms)/1000
def assign_speakers(srt_blocks, diarization):
"""Assigns a speaker to each subtitle block."""
result = []
for idx, start, end, text in srt_blocks:
s = time_to_seconds(start)
e = time_to_seconds(end)
best_speaker = "UNKNOWN"
best_overlap = 0
for turn, _, speaker in diarization.itertracks(yield_label=True):
overlap = min(e, turn.end) - max(s, turn.start)
if overlap > best_overlap:
best_overlap = overlap
best_speaker = speaker
result.append((idx, start, end, f"[{best_speaker}] {text}"))
return result
The result is an .srt file where each block indicates who is speaking:
1
00:00:01,000 --> 00:00:04,500
[SPEAKER_00] Welcome to a new Café con OfiLibre.
2
00:00:04,500 --> 00:00:08,200
[SPEAKER_00] Today we are going to talk about licenses.
3
00:00:08,200 --> 00:00:12,800
[SPEAKER_01] Thank you very much for the invitation.
The generic identifiers (SPEAKER_00, SPEAKER_01) can be manually replaced with the actual participant names using a simple find-and-replace:
sed -i 's/SPEAKER_00/Jesús/g; s/SPEAKER_01/María/g' subtitles.srt
FFmpeg: embedding subtitles in the video
Once we have the .srt file with the transcription and identified speakers, we use FFmpeg to embed the subtitles in the video. We have already covered FFmpeg in depth in our video generation guide, so here we focus on subtitling.
There are two ways to add subtitles:
Soft subtitles (soft subs)
They are packaged inside the video container as a separate track. The viewer can turn them on or off:
ffmpeg -i video.mp4 -i subtitles.srt \
-c copy -c:s mov_text \
video_with_subs.mp4
Hard subtitles (hard subs)
They are “burned” directly onto the video image. They are always visible and cannot be toggled off, but they guarantee visibility in any player:
ffmpeg -i video.mp4 \
-vf "subtitles=subtitles.srt:force_style='FontSize=22,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,Outline=2'" \
-c:a copy \
subtitled_video.mp4
The force_style parameters allow customization of subtitle appearance using ASS/SSA syntax. The most useful are:
FontSize: font size.PrimaryColour: text color in&HAABBGGRRformat.OutlineColour: outline color.Outline: outline thickness in pixels.FontName: typeface (must be installed on the system).
For Cafés con OfiLibre we use soft subs for YouTube (the platform handles them natively) and hard subs for versions distributed through other channels.
YouTube API: automated uploads
The last step in the workflow is uploading the edited and subtitled video to YouTube. The YouTube Data API v3 allows automating this task from a script.
Prerequisites
Before using the API you need to create a project in the Google Cloud Console, enable the YouTube Data API v3, and obtain OAuth 2.0 credentials. The process is:
- Create a project in Google Cloud Console.
- Enable the “YouTube Data API v3”.
- Configure the OAuth consent screen.
- Create “OAuth client ID” credentials for a desktop application.
- Download the
client_secrets.jsonfile.
Dependencies
pip install google-api-python-client google-auth-oauthlib
Upload script
import os
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload
SCOPES = ["https://www.googleapis.com/auth/youtube.upload"]
def authenticate():
"""Authenticates with OAuth 2.0 and returns the service."""
flow = InstalledAppFlow.from_client_secrets_file(
"client_secrets.json", SCOPES
)
credentials = flow.run_local_server(port=0)
return build("youtube", "v3", credentials=credentials)
def upload_video(youtube, filepath, title, description, tags):
"""Uploads a video to YouTube."""
body = {
"snippet": {
"title": title,
"description": description,
"tags": tags,
"categoryId": "27", # Education
"defaultLanguage": "es",
},
"status": {
"privacyStatus": "unlisted", # Unlisted initially
"selfDeclaredMadeForKids": False,
},
}
media = MediaFileUpload(filepath, chunksize=-1, resumable=True)
request = youtube.videos().insert(
part="snippet,status", body=body, media_body=media
)
response = request.execute()
print(f"Video uploaded: https://www.youtube.com/watch?v={response['id']}")
return response["id"]
def upload_captions(youtube, video_id, srt_path, language="es"):
"""Uploads subtitles to an already published video."""
body = {
"snippet": {
"videoId": video_id,
"language": language,
"name": f"Subtitles ({language})",
}
}
media = MediaFileUpload(srt_path)
youtube.captions().insert(
part="snippet", body=body, media_body=media
).execute()
print(f"Subtitles ({language}) uploaded successfully.")
Usage
youtube = authenticate()
video_id = upload_video(
youtube,
filepath="cafe-2026-05-14-edited.mp4",
title="Free software licenses - Café con OfiLibre",
description="Talk about the most common free license types.",
tags=["free software", "OfiLibre", "URJC", "licenses"],
)
upload_captions(youtube, video_id, "subtitles.srt", language="es")
The first time it runs, a browser window will open for authorization. Credentials can be saved to a local file so that subsequent runs require no manual intervention.
Putting it all together: the complete script
In practice, the entire workflow is chained in a bash script that takes the edited video file as a parameter and runs all steps sequentially:
#!/bin/bash
# automate-cafe.sh - Complete post-production workflow
# Usage: ./automate-cafe.sh edited_video.mp4 "Cafe title" "Description"
VIDEO="$1"
TITLE="$2"
DESC="$3"
BASENAME="${VIDEO%.*}"
echo "=== 1. Transcribing with Whisper ==="
whisper "$VIDEO" --language es --model medium --output_format srt \
--output_dir ./subs/
echo "=== 2. Diarizing with Pyannote ==="
python3 diarize.py "$VIDEO" "./subs/${BASENAME}.srt" \
--output "./subs/${BASENAME}_diarized.srt"
echo "=== 3. Embedding subtitles with FFmpeg ==="
ffmpeg -i "$VIDEO" -i "./subs/${BASENAME}_diarized.srt" \
-c copy -c:s mov_text \
"${BASENAME}_final.mp4"
echo "=== 4. Uploading to YouTube ==="
python3 upload_youtube.py \
--video "${BASENAME}_final.mp4" \
--srt "./subs/${BASENAME}_diarized.srt" \
--title "$TITLE" \
--description "$DESC"
echo "=== Process complete ==="
Hardware requirements
To run the entire workflow on a local machine, the requirements are moderate:
- CPU: any modern processor (i5/Ryzen 5 or better). CPU-based transcription takes 3 to 5 times the video duration.
- GPU (recommended): an NVIDIA card with at least 4 GB of VRAM for the Whisper
mediummodel. With a GPU, transcribing a 15-minute video takes roughly 2-3 minutes. - RAM: 8 GB minimum, 16 GB recommended.
- Storage: each edited episode takes between 200 MB and 1 GB depending on resolution.
Licenses of the software used
All software mentioned in this guide is free or open source:
- Whisper: MIT license.
- Pyannote: MIT license.
- FFmpeg: LGPL / GPL licenses.
- google-api-python-client: Apache 2.0 license.
This means you can use, modify, and redistribute them freely, for both personal and institutional projects.
Conclusion
Automating audiovisual production is not just about efficiency — it is also a way to ensure content accessibility. Automatic subtitles allow people with hearing disabilities to follow the talks, and diarization adds context about who is speaking.
The workflow we have described — Whisper, Pyannote, FFmpeg, YouTube API — runs entirely locally with free tools, does not depend on third-party services for processing, and can be adapted to any type of periodic audiovisual content.
If you want to check the full source code of the scripts we use, you can find them in our repository. And if you produce audiovisual content at your faculty or department, feel free to contact the OfiLibre so we can help you set up a similar workflow.