https://github.com/rtfirst/voice-to-text

Cross-platform Push-to-Talk speech-to-text — local Whisper transcription (CUDA/MPS) with optional Anthropic API correction and live VU meter overlay. Windows 11 + macOS.
https://github.com/rtfirst/voice-to-text

cuda macos push-to-talk python speech-to-text voice-input whisper windows

Last synced: 18 days ago
JSON representation

Cross-platform Push-to-Talk speech-to-text — local Whisper transcription (CUDA/MPS) with optional Anthropic API correction and live VU meter overlay. Windows 11 + macOS.

Host: GitHub
URL: https://github.com/rtfirst/voice-to-text
Owner: rtfirst
License: mit
Created: 2026-04-06T05:28:47.000Z (3 months ago)
Default Branch: master
Last Pushed: 2026-04-06T06:28:49.000Z (3 months ago)
Last Synced: 2026-04-06T08:42:44.662Z (3 months ago)
Topics: cuda, macos, push-to-talk, python, speech-to-text, voice-input, whisper, windows
Language: Python
Size: 39.1 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# Voice-to-Text

A cross-platform background application for speech input via Push-to-Talk. Transcribes locally using OpenAI Whisper (GPU-accelerated) and optionally corrects text via the Anthropic API (Haiku). The transcribed text is automatically pasted into the active application — editor, browser, or terminal.

Supports **Windows 11** and **macOS**.

## Features

- **Push-to-Talk** — hold a configurable hotkey to record, release to transcribe and paste
- **Local transcription** — runs Whisper on your GPU, no cloud dependency for speech-to-text
- **Optional AI correction** — fixes grammar and punctuation via Anthropic API
- **Live VU meter** — pill-shaped overlay at the bottom of the screen reacts to your voice
- **Smart paste** — auto-detects window type and uses the appropriate paste method
- **Configurable via tray menu** — hotkey, model size, language, auto-correction
- **Persistent settings** — saved to `settings.json`, survives restarts
- **Cross-platform** — Windows (CUDA) and macOS (MPS / CPU)

## Requirements

- Python 3.13+
- [OpenAI Whisper](https://github.com/openai/whisper) with PyTorch

### Windows
- NVIDIA GPU with CUDA support (tested with RTX 3070, 8 GB VRAM)
- PyTorch with CUDA
- Packages: `pywin32`

### macOS
- Apple Silicon (MPS acceleration) or Intel (CPU fallback)
- PyTorch with MPS support (macOS 12.3+)
- Accessibility permissions required (System Settings → Privacy & Security → Accessibility)

### Both platforms
- `openai-whisper`, `anthropic`, `pystray`, `Pillow`, `numpy`, `sounddevice`

## Installation

```bash
pip install sounddevice
```

All other dependencies should already be present (PyTorch, Whisper, Anthropic SDK, etc.).

### macOS additional setup

Grant Accessibility permissions to your terminal or Python to allow global hotkey detection and keystroke simulation:

**System Settings → Privacy & Security → Accessibility** → add Terminal / iTerm2 / Python

### API Key

For optional text correction, an Anthropic API key is required. Create a `.env` file in the project root:

```
ANTHROPIC_API_KEY=sk-ant-...
```

## Usage

```bash
# With console output (for debugging)
python main.py

# Without console window (Windows only)
pythonw main.py
```

| Action | Description |
|--------|-------------|
| **Hold hotkey** | Start recording (VU meter reacts live) |
| **Release hotkey** | Stop recording, transcribe, paste into active app |
| **Right-click tray icon** | Settings and quit |

### VU Meter Overlay

| State | Display |
|-------|---------|
| Idle | Dark segments, semi-transparent |
| Recording | Live level: green → yellow → red |
| Transcribing | All segments yellow |
| Done | All segments green (briefly) |

## Settings (Tray Menu)

- **Hotkey** — Ctrl+Space, Alt+Space, Ctrl+F9, F13, and more
- **Model** — Whisper model size: tiny, small, medium, large-v3, turbo
- **Language** — auto, Deutsch, English
- **Auto-Correction** — text correction via Anthropic API on/off

Settings are saved to `settings.json` and persist across restarts.

## GPU Acceleration

The application auto-detects the best available compute device:

| Platform | Device | Notes |
|----------|--------|-------|
| Windows + NVIDIA | CUDA | Best performance |
| macOS Apple Silicon | MPS | Good performance on M1/M2/M3 |
| Any | CPU | Fallback, slower |

## Autostart

```bash
# Enable
python setup_autostart.py

# Disable
python setup_autostart.py --disable
```

- **Windows**: adds a registry entry under `HKCU\...\Run`
- **macOS**: creates a LaunchAgent plist in `~/Library/LaunchAgents/`

## Project Structure

```
voice-to-text/
main.py Entry point
setup_autostart.py Autostart management (Windows + macOS)
requirements.txt
src/
voice_to_text/
__init__.py
__main__.py python -m voice_to_text
app.py Main orchestration
audio.py Microphone recording (sounddevice)
config.py Configuration + persistent settings
hotkey.py Hotkey dispatcher
overlay.py VU meter overlay (tkinter)
paste.py Paste dispatcher
transcriber.py Whisper + Anthropic API
tray.py System tray menu (pystray)
platform/
__init__.py Platform detection
hotkey_win.py Windows: GetAsyncKeyState polling
hotkey_mac.py macOS: Quartz event tap
paste_win.py Windows: win32 clipboard + SendInput
paste_mac.py macOS: pbcopy + osascript
overlay_win.py Windows: Win32 layered window
overlay_mac.py macOS: tkinter alpha
```

## Paste Detection (Windows)

On Windows, the tool detects the foreground window type:

- **Standard apps** (editor, browser, IDE) → Ctrl+V
- **Windows Terminal** → Ctrl+Shift+V
- **mintty / Git Bash** → Shift+Insert
- **Legacy cmd.exe** → Shift+Insert

On macOS, Cmd+V is used universally.

## License

[MIT](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rtfirst/voice-to-text

Awesome Lists containing this project

README