https://github.com/idvoretskyi/voice-transcriber
Single-binary Ukrainian media-to-text transcription powered by Google Gemini via Vertex AI
https://github.com/idvoretskyi/voice-transcriber
Last synced: 28 days ago
JSON representation
Single-binary Ukrainian media-to-text transcription powered by Google Gemini via Vertex AI
- Host: GitHub
- URL: https://github.com/idvoretskyi/voice-transcriber
- Owner: idvoretskyi
- License: mit
- Created: 2025-07-11T18:43:36.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2026-04-14T20:08:34.000Z (about 2 months ago)
- Last Synced: 2026-04-14T20:23:36.218Z (about 2 months ago)
- Language: Go
- Homepage: https://github.com/idvoretskyi/ukrainian-voice-transcriber
- Size: 14.5 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# Voice Transcriber
Single-binary media-to-text transcription with automatic language detection, powered by Google Gemini via Vertex AI.
[](https://github.com/idvoretskyi/voice-transcriber/actions/workflows/ci.yml)
[](https://github.com/idvoretskyi/voice-transcriber/actions/workflows/codeql.yml)
[](https://github.com/idvoretskyi/voice-transcriber/actions/workflows/security.yml)
[](https://go.dev/dl/)
[](LICENSE)
## Features
- **Automatic language detection** — Gemini identifies the spoken language from audio (default)
- Specify language explicitly with `--language` using an ISO 639-1 code (e.g. `uk`, `en`, `de`)
- Accepts **audio and video files** as input
- **No Cloud Storage required** — audio bytes sent inline to Gemini
- FFmpeg used only for video-to-audio extraction; audio files go straight to Gemini
- Handles files up to ~8.4 hours in a single request (no chunking)
- Selectable Gemini model via `--model` flag (default: `gemini-3.1-flash-lite-preview`)
- Single static binary — no extra runtime dependencies beyond FFmpeg for video
## Quick Start
### Prerequisites
```bash
# Go 1.25+
brew install go # macOS
# sudo apt install golang-go # Ubuntu/Debian
# FFmpeg (only required for video files)
brew install ffmpeg # macOS
# sudo apt install ffmpeg # Ubuntu/Debian
```
### Install
```bash
go install github.com/idvoretskyi/voice-transcriber/cmd/voice-transcriber@latest
```
This installs the `voice-transcriber` binary into `$(go env GOPATH)/bin`.
```bash
export PATH="$(go env GOPATH)/bin:$PATH"
```
### Google Cloud setup
```bash
# Authenticate
gcloud auth login
gcloud auth application-default login
# Set project and enable Vertex AI
gcloud config set project YOUR_PROJECT_ID
gcloud services enable aiplatform.googleapis.com
```
The project is also read from the `GOOGLE_CLOUD_PROJECT` environment variable if set.
### Usage
```bash
# Transcribe a video file (language auto-detected, audio extracted via FFmpeg)
voice-transcriber transcribe input/meeting.mp4
# Transcribe an audio file directly
voice-transcriber transcribe input/interview.mp3
# Specify output file
voice-transcriber transcribe input/meeting.mp4 -o transcript.txt
# Force a specific language (ISO 639-1 code)
voice-transcriber transcribe input/meeting.mp4 --language uk
# Use a different model or location
voice-transcriber transcribe input/meeting.mp4 --model gemini-3-flash-preview
voice-transcriber transcribe input/meeting.mp4 --model gemini-2.5-flash --location us-central1
# Show version
voice-transcriber version
```
## CLI Reference
```
Usage:
voice-transcriber transcribe [media-file] [flags]
voice-transcriber version
Flags:
--language string Language for transcription: 'auto' for automatic detection,
or ISO 639-1 code (e.g. uk, en, de) (default: auto)
--model string Gemini model to use
(default: gemini-3.1-flash-lite-preview)
--location string Vertex AI location; Gemini 3.x models require global
(default: global)
-o, --output string Output file path
(default: output//.txt)
-v, --verbose Enable verbose output
-q, --quiet Suppress all output except results
```
## Supported Formats
| Type | Extensions |
|------|------------|
| **Audio** — sent directly to Gemini | `.wav` `.mp3` `.flac` `.ogg` `.m4a` `.aac` `.webm` `.pcm` |
| **Video** — audio extracted via FFmpeg | `.mp4` `.mkv` `.mov` `.avi` `.wmv` `.flv` `.ts` `.mpeg` `.3gp` |
Extension matching is case-insensitive. Maximum file size: 10 GB.
## Building from Source
```bash
git clone https://github.com/idvoretskyi/voice-transcriber.git
cd voice-transcriber
make build # produces ./voice-transcriber
make test # run tests with race detector
make lint # run golangci-lint
```
## License
MIT — see [LICENSE](LICENSE) for details.