An open API service indexing awesome lists of open source software.

https://github.com/aimer1124/local-voice-input

๐ŸŽ™๏ธ ๅฎŒๅ…จ็ฆป็บฟ็š„ๆœฌๅœฐ AI ่ฏญ้Ÿณ Prompt ่พ“ๅ…ฅๅ™จ โ€” Whisper + Ollama + Raycast๏ผŒ้›ถไบ‘็ซฏไพ่ต–๏ผŒ้š็ง 100% ่‡ชๆŽง
https://github.com/aimer1124/local-voice-input

bash llm local-ai macos offline-ai ollama privacy prompt-engineering raycast swift voice-input whisper whisper-cpp

Last synced: 23 days ago
JSON representation

๐ŸŽ™๏ธ ๅฎŒๅ…จ็ฆป็บฟ็š„ๆœฌๅœฐ AI ่ฏญ้Ÿณ Prompt ่พ“ๅ…ฅๅ™จ โ€” Whisper + Ollama + Raycast๏ผŒ้›ถไบ‘็ซฏไพ่ต–๏ผŒ้š็ง 100% ่‡ชๆŽง

Awesome Lists containing this project

README

          

# ๐ŸŽ™๏ธ local-voice-input โ€” Local AI Voice Prompt Input

[![CI](https://github.com/aimer1124/local-voice-input/actions/workflows/release.yml/badge.svg)](https://github.com/aimer1124/local-voice-input/actions/workflows/release.yml)
[![Latest release](https://img.shields.io/github/v/release/aimer1124/local-voice-input?color=brightgreen)](https://github.com/aimer1124/local-voice-input/releases/latest)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](./LICENSE)
[![Platform](https://img.shields.io/badge/platform-macOS%2013%2B%20%7C%20Apple%20Silicon-lightgrey)](#requirements)
[![Roadmap](https://img.shields.io/badge/roadmap-v1.1%20%E2%86%92%20v2.0-blue)](./ROADMAP.md)

> **Fully offline** voice-to-text tool for programmers: press a hotkey, speak, get cleaned-up text pasted at your cursor.
>
> _CLI tool name: `vinput`_

**100% local. Audio never leaves your machine. No cloud, no accounts.**

[ไธญๆ–‡ README](./README.md) ยท [CHANGELOG](./CHANGELOG.md) ยท [ROADMAP](./ROADMAP.md) ยท [Contributing](./CONTRIBUTING.md)

![demo](./assets/demo.gif)

---

## โœจ Features

- ๐Ÿ”’ **Fully offline**: Whisper.cpp + Ollama, your voice stays on-device
- ๐Ÿง  **AI intent refinement**: local LLM filters fillers and self-corrections, turns rambling into a structured prompt
- โšก **Fast**: 10s of Chinese audio โ†’ text in 2โ€“4s
- ๐ŸŽ›๏ธ **Toggle hotkey**: press to start, press again to stop โ€” intuitive
- ๐Ÿ–ฅ๏ธ **Multi-monitor HUD**: frosted-glass overlay near the bottom-center of whichever screen your cursor is on
- ๐ŸŽฏ **Hotword-aware**: customizable hotwords boost recognition for technical terms
- ๐Ÿ”Œ **Zero-config**: works out of the box; every knob is configurable

---

## ๐ŸŽฌ Flow

```
Press โŒ˜โ‡งSpace (Raycast hotkey)
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐ŸŽ™๏ธ Recording... โ”‚ โ† Pop sound
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ Speak your prompt
โ–ผ
Press โŒ˜โ‡งSpace again โ† Tink sound
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐Ÿ’ญ Transcribing... โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ†“ Whisper.cpp
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ๐Ÿค– Polishing with AI... โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ†“ Ollama (qwen2.5:3b)
โ”‚ โ†“ pbcopy + โŒ˜V
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ โœ“ Done โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
Cleaned text appears at cursor
```

---

## ๐Ÿง  How It Works

### Architecture

```
Raycast Global Hotkey
โ””โ”€ voice-input.sh (Raycast Script Command)
โ””โ”€ vinput_bg.sh (main logic)
โ”‚
โ”œโ”€ Lock dir (atomic mkdir) = mutex + state machine
โ”‚ โ”œโ”€ First press โ†’ record mode
โ”‚ โ””โ”€ Second press โ†’ toggle mode (SIGINT to rec)
โ”‚
โ”œโ”€ SoX rec โ†’ Whisper-cli โ†’ Ollama โ†’ pbcopy + โŒ˜V
โ”‚
โ”œโ”€ 30s guard timeout (safety)
โ”‚
โ””โ”€ HUD (Swift binary) โ†’ screen-center overlay
```

### Key Mechanisms

#### Toggle Lock (mkdir + PID file)

`/tmp/vinput.lock.d` doubles as a mutex AND state machine:

| State | mkdir | PID file | Behavior |
|---|---|---|---|
| Fresh session | succeeds | created | enter record mode |
| Mid-recording press | fails | exists | toggle mode, SIGINT to rec |
| Mid-transcription press | fails | gone | "still processing previous" HUD |
| Stale lock (crash) | fails | dead PID | auto-clean, restart |

`mkdir` is atomic and portable (macOS has no `flock`).

#### Recording Control (USE_VAD)

**Default USE_VAD=0** (recommended):
- `rec` starts capturing immediately regardless of volume
- Second hotkey press โ†’ instant SIGINT stop
- 30s hard timeout as safety net

**Optional USE_VAD=1**:
- SoX silence filter: stop after 1.5s of silence
- Only works in quiet environments

#### Whisper Transcription

- Model: `ggml-large-v3-turbo-q5_0` (547MB, Metal backend)
- Hotwords injected via `--prompt` for better technical term recognition
- Forced `-l zh` for Chinese, English terms guided by prompt

#### LLM Refinement (with short-text skip)

- Text shorter than `SHORT_TEXT_THRESHOLD` (15 chars default) โ†’ skip LLM, save 1โ€“2s
- Longer text โ†’ Ollama with `keep_alive=30m` for warm model
- Fallback to raw Whisper output on failure

#### Screen HUD

- ~90 lines of Swift compiled to 92KB binary
- `NSVisualEffectView` with `.hudWindow` material (matches system volume HUD)
- `/tmp/vinput_hud.pid` for singleton pattern: new HUD kills previous
- Mouse-passthrough, cross-Space, auto-width
- Multi-screen aware via `NSEvent.mouseLocation`

#### UTF-8 Encoding

Raycast-spawned processes don't inherit Terminal's LANG, causing `pbcopy` to misinterpret UTF-8 bytes. Script forces:
```bash
export LANG="${LANG:-en_US.UTF-8}"
export LC_ALL="${LC_ALL:-en_US.UTF-8}"
```

---

## ๐Ÿ“ฆ Installation

### Homebrew Tap (recommended ยท v1.1.1+)

```bash
brew tap aimer1124/tap
brew install local-voice-input
vinput setup
```

`brew install` only deploys scripts and the HUD binary to `libexec/`. The follow-up `vinput setup` is the bootstrap step that:

1. Installs `sox`, `jq`, `whisper-cpp`, `ollama` via brew (skips if present)
2. Downloads Whisper `large-v3-turbo-q5_0` (~547MB)
3. Symlinks `vinput / vinput.sh / vinput_bg.sh / hud` into `~/.whisper_models/`
4. Writes default config `~/.config/vinput.conf` + hotwords list
5. Copies the Raycast command script to `~/.config/raycast-scripts/`
6. Starts the Ollama service, pulls `qwen2.5:3b` (~2GB), pre-warms it

Upgrades: `brew upgrade local-voice-input` (the scripts are symlinks, so version bumps land instantly).

### From source (developer / no brew tap)

```bash
git clone https://github.com/aimer1124/local-voice-input.git
cd local-voice-input
./install.sh
```

`install.sh` covers the same ground as `vinput setup` but without going through the brew tap: it installs deps with `brew install` directly, downloads the Whisper model, compiles or downloads the HUD binary, copies scripts to `~/.whisper_models/`, and warms up Ollama.

### Manual steps (cannot be automated)

1. **Privacy โ†’ Microphone**: enable Raycast (sox will prompt on first use)
2. **Privacy โ†’ Accessibility**: enable Raycast (for auto โŒ˜V paste)
3. **Raycast Settings โ†’ Extensions โ†’ Script Commands**:
- Add Script Directory: `~/.config/raycast-scripts`
- Bind a hotkey to `๐ŸŽ™๏ธ ่ฏญ้Ÿณ่พ“ๅ…ฅ` (recommended: `โŒ˜โ‡งSpace`)

### Requirements

- macOS 13+
- Apple Silicon (M1/M2/M3/M4)
- 16GB RAM recommended (8GB works)
- 3GB disk space
- A working microphone (โš ๏ธ 3-pole TRS music headphones may route input to a phantom port)

---

## ๐Ÿš€ Usage

1. Click into any text field
2. Press your hotkey (e.g., โŒ˜โ‡งSpace)
3. Speak after the Pop sound
4. Press the hotkey again to stop
5. Wait 2โ€“4s โ€” text appears at cursor

### Performance budget (10s Chinese audio)

| Stage | Time |
|---|---|
| Recording | however long you speak |
| Whisper | ~2s |
| Ollama | ~1s (short skip) / ~2s (long) |
| Paste | <0.1s |
| **Overhead** | **2โ€“4s** |

---

## โš™๏ธ Configuration

All settings live in `~/.config/vinput.conf`:

```bash
# Whisper ASR
MODEL_PATH="$HOME/.whisper_models/ggml-large-v3-turbo-q5_0.bin"
WHISPER_LANG="zh"
WHISPER_THREADS=8

# Ollama
OLLAMA_MODEL="qwen2.5:3b"
SHORT_TEXT_THRESHOLD=15

# Behavior
AUTO_PASTE=1
USE_VAD=0
MAX_REC_SECONDS=30
```

Hotwords go in `~/.config/vinput_hotwords.txt` (one per line).

---

## ๐Ÿ†š vs Commercial IMEs

| | vinput | Commercial IMEs |
|---|---|---|
| Privacy | โœ… Fully local | โŒ Cloud |
| Offline | โœ… | โŒ |
| Technical terms | โœ… Custom hotwords + LLM | โš ๏ธ Generic |
| Intent refinement | โœ… AI prompt distillation | โŒ Transcription only |
| Streaming output | โŒ Wait until done | โœ… Real-time |
| Direct insertion | โš ๏ธ via clipboard | โœ… System IME |
| Dialects | โš ๏ธ Mandarin only | โœ… Many |

**Best positioning**: use vinput as an **"AI Prompt dictation button"** alongside your system IME โ€” vinput for long prompts to Claude/Cursor/ChatGPT, system IME for chat/passwords/short replies.

---

## ๐Ÿ“„ License

MIT โ€” see [LICENSE](./LICENSE)

---

## ๐Ÿ™ Acknowledgements

- [whisper.cpp](https://github.com/ggerganov/whisper.cpp)
- [Ollama](https://ollama.com/)
- [Qwen](https://github.com/QwenLM/Qwen)
- [Raycast](https://www.raycast.com/)
- [SoX](http://sox.sourceforge.net/)