https://github.com/xsa-dev/video_to_text
https://github.com/xsa-dev/video_to_text
assembly converter util video-to-text whisper-cpp
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/xsa-dev/video_to_text
- Owner: xsa-dev
- Created: 2025-10-18T07:42:01.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-20T20:24:23.000Z (8 months ago)
- Last Synced: 2025-10-26T13:29:24.788Z (8 months ago)
- Topics: assembly, converter, util, video-to-text, whisper-cpp
- Language: Python
- Homepage:
- Size: 6.84 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Video to Text Converter
A simple utility that extracts audio tracks from video files and transcribes them using either AssemblyAI API or whisper-cli, saving the resulting text alongside each source video.
## Installation
1. Install the Python dependencies:
```bash
pip install -r requirements.txt
```
2. Install ffmpeg (required for audio extraction):
```bash
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update
sudo apt install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html
```
3. Choose your transcription method:
### Option A: Using whisper-cli (Recommended)
1. Install whisper-cli:
```bash
# macOS
brew install whisper-cli
# Or build from source: https://github.com/ggerganov/whisper.cpp
```
2. Download a whisper model (e.g., ggml-large-v3-turbo.bin) and place it in the project directory.
### Option B: Using AssemblyAI API
1. Provide your AssemblyAI API key using one of the following methods:
```bash
# Preferred: environment variable
export ASSEMBLYAI_API_KEY="your_api_key_here"
# Optional: add api_key to the [assemblyai] section in config.toml
```
## Configuration
All settings live in `config.toml`:
- `video.paths` — list of absolute or relative paths to the video files to process
- `transcription.method` — choose between "whisper" or "assemblyai"
- `whisper` — whisper-cli related options (model path, language, threads, etc.)
- `assemblyai` — AssemblyAI related options (API key, speech model, language, formatting)
- `audio` — audio extraction parameters (sample rate, channels, codec)
### Example configuration
```toml
[video]
paths = [
"/path/to/video1.mp4",
"/path/to/video2.mp4"
]
[transcription]
# Method: "whisper" or "assemblyai"
method = "whisper"
[whisper]
# Path to the whisper model file
model_path = "ggml-large-v3-turbo.bin"
# Language code (e.g., "ru", "en", "auto")
language = "ru"
# Number of threads
threads = 16
# Suppress non-speech tokens
suppress_nst = true
# Max context
max_context = 128
# No prompt
no_prompt = true
# Best of N
best_of = 7
[assemblyai]
# Leave blank to rely on ASSEMBLYAI_API_KEY
api_key = ""
speech_model = "universal"
language = "en"
punctuate = true
format_text = true
[audio]
sample_rate = 16000
channels = 1
codec = "pcm_s16le"
```
## Usage
1. Update `config.toml` with video paths and your preferred transcription method.
2. If using AssemblyAI, obtain an API key from [AssemblyAI](https://www.assemblyai.com/) and set it as an environment variable or in the config file.
3. If using whisper-cli, ensure the model file is in the project directory.
4. Run the script:
```bash
python main.py
```
## What the script does
1. Loads configuration from `config.toml`.
2. Extracts audio from each video file to WAV (16 kHz, mono).
3. Transcribes the audio using the selected method (whisper-cli or AssemblyAI).
4. Saves the transcript to a `.txt` file next to the source video.
5. Removes temporary audio files once transcription completes.
## Output format
For each video file `video.mp4`, the script creates a paired transcript file `video.txt`.