An open API service indexing awesome lists of open source software.

https://github.com/kurkigal/speech-to-text-service

Speech-to-text service powered by Faster-Whisper, FastAPI, and Typer.
https://github.com/kurkigal/speech-to-text-service

fastapi python python-programming ruff speech-to-text stt

Last synced: 2 months ago
JSON representation

Speech-to-text service powered by Faster-Whisper, FastAPI, and Typer.

Awesome Lists containing this project

README

          

# Speech-to-Text Service

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code Style: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.109+-009688.svg?style=flat&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com)

A high-performance, asynchronous speech-to-text service powered by **Faster-Whisper**, **FastAPI**, and **Typer**.

This project provides a robust architecture for audio transcription, exposing both a RESTful API for microservices integration and a CLI for offline batch processing. It features modular design, lazy model loading, and rigorous configuration management, making it suitable for both research and production environments.

## Project Status (Demo / MVP)

This repository is currently a **demo / MVP** implementation intended to showcase a clean, production-oriented architecture for speech-to-text services (FastAPI + Faster-Whisper + CLI).

While the core API/CLI workflow is functional, the project is **actively evolving** and several capabilities are still planned or experimental. Expect breaking changes, refactors, and incremental improvements as the roadmap items are implemented.

Contributions, suggestions, and issue reports are welcome.

## Key Features

- **High Performance:** Utilizes `faster-whisper` (CTranslate2) for up to 4x faster inference than OpenAI's original implementation.
- **Production Ready:** FastAPI factory pattern with health checks (`/v1/health`) and efficient resource management.
- **Dual Interface:** - **REST API:** Fully typed endpoints for seamless integration.
- **CLI Tool:** Typer-based command-line interface for local testing and automation.
- **Robust Engineering:**
- Lazy-loading strategies to optimize memory usage.
- Audio validation and normalization (16 kHz resampling) utilities.
- Centralized configuration via Pydantic Settings (environment variable driven).
- **Quality Assurance:** Comprehensive `pytest` suite with async HTTP fixtures and service mocking.

## Project Structure

The project follows a modern `src`-layout to prevent import errors and separate source code from tests/scripts.

```text
.
├── pyproject.toml # Build metadata, dependencies, and tool configs
├── README.md # Project documentation
├── scripts/
│ └── download_model.py # Helper script to pre-fetch model weights
├── src/
│ └── stt_service/
│ ├── app.py # FastAPI application factory
│ ├── cli.py # CLI entrypoint (Typer)
│ ├── config.py # Environment & Settings management
│ ├── models.py # Shared Data Transfer Objects (DTOs)
│ ├── api/
│ │ └── routes.py # API Route definitions
│ ├── services/
│ │ └── transcription.py# Core business logic adapter
│ └── utils/
│ └── audio.py # Audio processing utilities
└── tests/
├── conftest.py # Pytest fixtures
├── test_api.py # Integration tests
└── test_transcription_service.py

```

## Getting Started

### Prerequisites

* Python 3.9 or higher
* FFmpeg (required for audio processing)

### Installation

1. **Clone and Setup Virtual Environment**
```powershell
# Create a virtual environment
python -m venv .venv

# Activate environment (Windows)
.\.venv\Scripts\activate
# For Linux/Mac: source .venv/bin/activate

```

2. **Install Dependencies**
```powershell
# Install project in editable mode
pip install -e .

# Install development dependencies (testing, linting)
pip install -e .[dev]

```

3. **Configuration (Optional)**

The service is configured via environment variables. You can create a `.env` file in the root directory.

| Variable | Default | Description |
| --- | --- | --- |
| `STT_WHISPER_MODEL_SIZE` | `base` | Model size (tiny, base, small, medium, large-v2) |
| `STT_WHISPER_COMPUTE_TYPE` | `int8` | Quantization type (`float16` for GPU, `int8` for CPU) |
| `STT_DEVICE` | `auto` | `cuda` or `cpu` |

4. **Download Model Weights**

Recommended to run before starting the service to avoid timeouts on the first request.

```powershell
python scripts/download_model.py

```

## Usage

### 1. Running the API Server

Start the production server using the CLI wrapper or Uvicorn directly.

```powershell
# Using the CLI wrapper
stt-service serve --host 0.0.0.0 --port 8000

# OR using Uvicorn directly
uvicorn stt_service.app:create_app --factory --reload

```

*Swagger UI will be available at: `http://localhost:8000/docs`*

### 2. Using the CLI

Transcribe audio files directly from your terminal.

```powershell
stt-service transcribe path/to/audio.wav --language en

```

## API Reference

* **GET** `/v1/health`
Returns service status and loaded model information.

* **POST** `/v1/transcribe`
Upload an audio file for transcription.
Supports parameters for language and beam size.

## Development

To maintain code quality, we use `ruff` for linting and formatting.

```powershell
# Run tests
pytest

# Run linter
ruff check .

# Format code
ruff format .

```

## Roadmap (Planned Improvements)

The following items are not yet implemented and represent the next iterations for this demo/MVP:

* [ ] implementation of speaker diarization.
* [ ] WebSocket support for real-time streaming transcription.
* [ ] Database integration for persistent transcript storage.
* [ ] Web-based UI for easier file uploads.

---