https://github.com/robacarp/whisper-cry
A Crystal wrapper for Whisper.CPP
https://github.com/robacarp/whisper-cry
whisper-cpp
Last synced: 5 days ago
JSON representation
A Crystal wrapper for Whisper.CPP
- Host: GitHub
- URL: https://github.com/robacarp/whisper-cry
- Owner: robacarp
- License: mit
- Created: 2026-03-07T06:11:21.000Z (about 1 month ago)
- Default Branch: master
- Last Pushed: 2026-03-08T22:54:33.000Z (about 1 month ago)
- Last Synced: 2026-04-04T01:59:39.625Z (5 days ago)
- Topics: whisper-cpp
- Language: Crystal
- Homepage:
- Size: 35.2 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# whisper-cry
Crystal bindings for [whisper.cpp](https://github.com/ggml-org/whisper.cpp), providing local speech-to-text transcription using OpenAI's Whisper models. Version tracks whisper.cpp releases (currently v1.8.3).
## Installation
1. Add the dependency to your `shard.yml`:
```yaml
dependencies:
whisper-cry:
github: robacarp/whisper-cry
```
2. Run `shards install`
3. Build the native libraries:
```sh
cd lib/whisper-cry && make
```
This clones whisper.cpp v1.8.3, builds it as a static library, and copies the `.a` files into `vendor/lib/`. Requires `cmake` and a C++ compiler. See the [whisper.cpp build documentation](https://github.com/ggml-org/whisper.cpp#building-the-project) for platform-specific details and options.
4. Download a Whisper model (e.g. the base English model):
```sh
curl -L -o ggml-base.en.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin
```
See the [whisper.cpp models directory](https://github.com/ggml-org/whisper.cpp/tree/master/models) for all available models.
5. Optimize the model for your hardware (optional but recommended):
The Whisper.cpp project has documentation and scripting support for optimizing models for different hardware, quantization, etc.
- MacOS [CoreML](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#core-ml-support)
- [OpenVINO](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#openvino-support)
- [Nvidia](https://github.com/ggml-org/whisper.cpp?tab=readme-ov-file#nvidia-gpu-support)
## Usage
```crystal
require "whisper-cry"
whisper = Whisper.new("/path/to/ggml-base.en.bin")
segments = whisper.transcribe_file("audio.wav")
segments.each do |segment|
puts "#{segment.start_timestamp} --> #{segment.end_timestamp}"
puts segment.text
end
whisper.close
```
Audio files must be 16-bit PCM WAV, mono, 16kHz. Convert with ffmpeg:
```sh
ffmpeg -i input.mp3 -ar 16000 -ac 1 -f wav output.wav
```
### API
#### `Whisper.new(model_path, use_gpu = false)`
Loads a GGML-format model file and initializes the inference context. Set `use_gpu: true` to enable Metal acceleration on macOS. Raises `Whisper::Error` if the model file is missing or fails to load.
#### `#transcribe_file(path, language = "en", n_threads = 4, translate = false)`
Transcribes a WAV file and returns an `Array(Whisper::Segment)`. The file must be 16-bit signed PCM, mono, 16kHz.
#### `#transcribe(samples, language = "en", n_threads = 4, translate = false)`
Transcribes pre-loaded `Float32` audio samples (normalized to `[-1.0, 1.0]`, mono, 16kHz). Useful when you already have audio data in memory.
Options:
- **language**: BCP-47 code (e.g. `"en"`, `"es"`), or `nil` for auto-detection
- **n_threads**: CPU threads for inference
- **translate**: when `true`, translates to English regardless of source language
#### `#close`
Frees the underlying whisper context. Safe to call multiple times. Also called automatically by `#finalize`.
#### `#version`, `#model_type`, `#multilingual?`, `#system_info`
Query the whisper.cpp version string, loaded model type (e.g. `"base"`), multilingual support, and available CPU features.
### `Whisper::Segment`
Each segment represents a span of recognized speech:
| Method | Returns |
|---|---|
| `#text` | Transcribed text |
| `#start_ms` / `#end_ms` | Timing in milliseconds |
| `#start_seconds` / `#end_seconds` | Timing in seconds |
| `#duration_ms` | Segment duration in milliseconds |
| `#start_timestamp` / `#end_timestamp` | Formatted as `"HH:MM:SS.mmm"` |
| `#no_speech_probability` | `Float32` (0.0-1.0), higher = likely not speech |
| `#speaker_turn_next` | `true` if next segment is a different speaker |
## Development
Run tests:
```sh
crystal spec
```
Tests cover `Segment` formatting/conversion, WAV file parsing and validation, and `Whisper` initialization error handling. No model file is needed to run the test suite.
## License
[MIT](LICENSE)