https://github.com/deepgram/voice-keyboard-linux

Linux virtual keyboard driver which types what you say using Deepgram Flux STT API
https://github.com/deepgram/voice-keyboard-linux

Last synced: 2 months ago
JSON representation

Linux virtual keyboard driver which types what you say using Deepgram Flux STT API

Host: GitHub
URL: https://github.com/deepgram/voice-keyboard-linux
Owner: deepgram
License: isc
Created: 2025-07-11T19:38:10.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-11-06T18:08:15.000Z (8 months ago)
Last Synced: 2025-11-06T20:20:40.769Z (8 months ago)
Language: Rust
Homepage:
Size: 147 KB
Stars: 6
Watchers: 0
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

# Voice Keyboard

Voice keyboard is a demo application showcasing Deepgram's new turn-taking speech-to-text API: **Flux**.

A voice-controlled Linux virtual keyboard that converts speech to text and types it into any application.

As a result of directly targeting Linux as a driver, this works with all Linux applications.

## Features

- **Voice-to-Text**: Real-time speech recognition using Deepgram's **Flux** API service (turn-taking STT)
- **Virtual Keyboard**: Creates a virtual input device that works with all applications
- **Incremental Typing**: Smart transcript updates with minimal backspacing for real-time corrections

## Architecture

The application solves a common Linux privilege problem:
- **Virtual keyboard creation** requires root access to `/dev/uinput`
- **Audio input** requires user-space access to PipeWire/PulseAudio

**Solution**: The application starts with root privileges, creates the virtual keyboard, then drops privileges to access the user's audio session.

## Installation

### Prerequisites

```bash
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://rustup.rs | sh

# Install required system packages (Fedora/RHEL)
sudo dnf install alsa-lib-devel

# Install required system packages (Ubuntu/Debian)
sudo apt install libasound2-dev
```

### Build

```bash
git clone
cd voice-keyboard
cargo build
```

### Acquire a Deepgram API key

You’ll need a Deepgram API key to authenticate with Flux.

- Create or manage keys in the Deepgram console: [Create additional API keys](https://developers.deepgram.com/docs/create-additional-api-keys)
- Export the key so the app can pick it up (recommended):
```bash
export DEEPGRAM_API_KEY="dg_your_api_key_here"
```
- The client sends the header `Authorization: Token `.
- For CI or systemd services, set `DEEPGRAM_API_KEY` in the environment for the service user.
- Security tip: treat API keys like passwords. Prefer env vars over committing keys to files.

## Usage

### Easy Method (Recommended)

Use the provided runner script:

```bash
./run.sh
```

### Manual Method

```bash
# Build and run with proper privilege handling
cargo build
sudo -E ./target/debug/voice-keyboard --test-stt
```

**Important**: Always use `sudo -E` to preserve environment variables needed for audio access.

## Speech-to-Text Service

This application uses **Deepgram Flux**, the company's new turn‑taking STT API. The default WebSocket URL is `wss://api.deepgram.com/v2/listen`.

## Command Line Options

```bash
voice-keyboard [OPTIONS]

OPTIONS:
--test-audio Test audio input and show levels
--test-stt Test speech-to-text functionality (default if no other mode specified)
--debug-stt Debug speech-to-text (print transcripts without typing)
--stt-url Custom STT service URL (default: wss://api.deepgram.com/v2/listen)
-h, --help Print help information
-V, --version Print version information
```

**Note**: If no mode is specified, the application defaults to `--test-stt` behavior.

## How It Works

1. **Initialization**: Application starts with root privileges
2. **Virtual Keyboard**: Creates `/dev/uinput` device as root
3. **Privilege Drop**: Drops to original user privileges
4. **Audio Access**: Accesses PipeWire/PulseAudio in user space
5. **Speech Recognition**: Streams audio to **Deepgram Flux** STT service
6. **Incremental Typing**: Updates text in real-time with smart backspacing
7. **Turn Finalization**: Clears tracking on "EndOfTurn" events (user presses Enter manually)

### Transcript Handling

The application provides sophisticated real-time transcript updates:

- **Incremental Updates**: As speech is recognized, the application updates the typed text by finding the common prefix between the current and new transcript, backspacing only the changed portion, and typing the new ending
- **Smart Backspacing**: Minimizes cursor movement by only removing characters that actually changed
- **Turn Management**: On "EndOfTurn" events, the application clears its internal tracking but doesn't automatically press Enter, allowing users to review before submitting

## About Deepgram Flux (Early Access)

- **Endpoint**: `wss://api.deepgram.com/v2/listen`
- **What it is**: Flux is Deepgram's turn‑taking, low‑latency STT API designed for conversational experiences.
- **Authentication**: Send an `Authorization` header. Common forms:
- `Token ` (what this app uses)
- `token ` or `Bearer ` are also accepted by the platform
- **Message types** (each server message includes a JSON `type` field):
- `Connected` — initial connection confirmation
- `TurnInfo` — streaming transcription updates with fields: `event` (`Update`, `StartOfTurn`, `Preflight`, `SpeechResumed`, `EndOfTurn`), `turn_index`, `audio_window_start`, `audio_window_end`, `transcript`, `words[] { word, confidence }`, `end_of_turn_confidence`
- `Error` — fatal error with fields: `code`, `description` (may also include a close code)
- `Configuration` — echoes/acknowledges configuration (e.g., thresholds) when provided
- **Client close protocol**: After sending your final audio, send a control message:
- `{ "type": "CloseStream" }`
The server will flush any remaining responses and then close the WebSocket.
- **Update cadence**: Flux produces updates about every **240 ms** with a typical worst‑case latency of ~**500 ms**.
- **Common query parameters** (as supported by the preview spec):
- `model`, `encoding`, `sample_rate`, `preflight_threshold`, `eot_threshold`, `eot_timeout_ms`, `keyterm`, `mip_opt_out`, `tag`

## Security

- **Minimal Root Time**: Only root during virtual keyboard creation
- **Environment Preservation**: Maintains user's audio session access
- **Clean Privilege Drop**: Properly drops both user and group privileges
- **No System Changes**: No permanent system configuration required

## Troubleshooting

### Audio Issues

If you get "Host is down" or "I/O error" when testing audio:

1. **Use `sudo -E`**: Always preserve environment variables
2. **Check PipeWire**: Ensure PipeWire is running: `systemctl --user status pipewire`
3. **Test without sudo**: Try `./target/debug/voice-keyboard --test-audio` (will fail on keyboard creation but audio should work)

### Permission Issues

If you get "Permission denied" for `/dev/uinput`:

1. **Check uinput module**: `sudo modprobe uinput`
2. **Verify device exists**: `ls -la /dev/uinput`
3. **Use sudo**: The application is designed to run with `sudo -E`

## Development

### Project Structure

```
src/
├── main.rs # Main application and privilege dropping
├── virtual_keyboard.rs # Virtual keyboard device management
├── audio_input.rs # Audio capture and processing
├── stt_client.rs # WebSocket STT client
└── input_event.rs # Linux input event constants
```

### Key Components

- **OriginalUser**: Captures and restores user context
- **VirtualKeyboard**: Manages uinput device lifecycle with smart transcript updates
- **AudioInput**: Cross-platform audio capture
- **SttClient**: WebSocket-based speech-to-text client
- **AudioBuffer**: Manages audio chunking for STT streaming

## License

ISC License. See LICENSE.txt

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deepgram/voice-keyboard-linux

Awesome Lists containing this project

README