https://github.com/eja/s2s

Lightweight local STT/TTS API server powered by Rust and sherpa-onnx.
https://github.com/eja/s2s

api rust stt tts

Last synced: 25 days ago
JSON representation

Lightweight local STT/TTS API server powered by Rust and sherpa-onnx.

Host: GitHub
URL: https://github.com/eja/s2s
Owner: eja
License: gpl-3.0
Created: 2026-05-22T10:33:37.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-05-23T10:26:49.000Z (about 1 month ago)
Last Synced: 2026-05-23T12:12:59.207Z (about 1 month ago)
Topics: api, rust, stt, tts
Language: Rust
Homepage: https://eja.it
Size: 45.9 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # S2S

A high-performance, lightweight API server written in Rust that provides local, privacy-conscious Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities. By leveraging the `sherpa-onnx` framework, S2S offers efficient local inference with minimal latency, requiring no external cloud dependencies.

The project aims to provide a drop-in local alternative for speech processing, featuring an API structure inspired by industry standards.

## Key Features

- **Local Inference:** All processing is done locally on your hardware.

- **Request Tracing:** Integrated logging providing real-time insights into IP addresses, status codes, and request latency.

- **Automated Model Management:** Built-in bootstrap logic to download necessary models automatically when using the `--auto` flag.

- **Flexible Service Fallbacks:** The server starts as long as at least one model is present. If only one model is loaded, requests to the missing service will return `404 Not Found`.

- **OpenAI-Compatible Voice Directory:** Exposes a standard `/v1/audio/voices` list, allowing client integrations to discover voices dynamically.

- **Broad STT Language Support:** Supports 25+ languages including English, Spanish, German, French, Russian, and many more.

- **Flexible TTS:** Integration with the **Kokoro** model, supporting over 50 distinct voices across 9 major languages.

- **Robust STT:** Powered by the **NVIDIA Parakeet TDT** model for accurate transcriptions.

---

## Getting Started

### Installation

Download the latest executable for your platform from the [Releases](https://github.com/eja/s2s/releases) page.

### Running the Server

The application requires at least one of the two models to be present locally in order to run. Execute the binary to start the server:

```bash

./s2s

```

If neither model is found on your system, the server will inform you and exit. You can instruct the server to automatically download and configure the required ONNX models (~1GB total) by specifying the `--auto` flag:

```bash

./s2s --auto

```

### Configuration Options

The server can be customized via command-line arguments:

| Argument | Description | Default |

| :--- | :--- | :--- |

| `--host` | The IP address to bind the server to | `127.0.0.1` |

| `--port` | The port to listen on | `35248` |

| `--kokoro` | Path to the Kokoro TTS model directory | `./models/kokoro...` |

| `--parakeet` | Path to the Parakeet STT model directory | `./models/sherpa...` |

| `--threads` | Number of threads for inference | `4` |

| `--auto` | Automatically download missing models | `false` |

| `--log` | Path to a file for persistent logging | `stderr` |

---

## API Reference

> **Note:** If the TTS or STT model is missing at startup, the server still launches, but any requests to the missing endpoints will return `404 Not Found`.

### 1. Speech-to-Text (STT)

**Endpoint:** `POST /v1/audio/transcriptions`

Transcribe an audio file to text. The endpoint expects a `multipart/form-data` request containing a WAV file. The model automatically detects the language from the supported list.

**Request:**

```bash

curl http://127.0.0.1:35248/v1/audio/transcriptions \

  -H "Content-Type: multipart/form-data" \

  -F "file=@audio.wav"

```

**Response:**

```json

{

  "text": "Hello world, this is a local transcription."

}

```

### 2. Text-to-Speech (TTS)

**Endpoint:** `POST /v1/audio/speech`

Synthesize text into audio.

**Request Body:**

| Field | Type | Description |

| :--- | :--- | :--- |

| `input` | String | The text to be synthesized |

| `voice` | String | (Optional) The voice ID (Default: `af_alloy`) |

**Example:**

```bash

curl http://127.0.0.1:35248/v1/audio/speech \

  -H "Content-Type: application/json" \

  -d '{

    "input": "Hello, I am a locally hosted voice.",

    "voice": "af_bella"

  }' --output output.wav

```

### 3. Voice Discovery

**Endpoint:** `GET /v1/audio/voices`

Retrieve the list of available TTS voices sorted alphabetically.

**Request:**

```bash

curl http://127.0.0.1:35248/v1/audio/voices

```

**Response:**

```json

{

  "voices": [

    { "id": "af_alloy", "name": "af_alloy" },

    { "id": "af_aoede", "name": "af_aoede" }

  ]

}

```

---

## Language & Voice Support

### Speech-to-Text (STT) Languages

S2S supports transcription for the following languages:

| | | | | |

| :--- | :--- | :--- | :--- | :--- |

| Bulgarian (`bg`) | Croatian (`hr`) | Czech (`cs`) | Danish (`da`) | Dutch (`nl`) |

| English (`en`) | Estonian (`et`) | Finnish (`fi`) | French (`fr`) | German (`de`) |

| Greek (`el`) | Hungarian (`hu`) | Italian (`it`) | Latvian (`lv`) | Lithuanian (`lt`) |

| Maltese (`mt`) | Polish (`pl`) | Portuguese (`pt`) | Romanian (`ro`) | Slovak (`sk`) |

| Slovenian (`sl`) | Spanish (`es`) | Swedish (`sv`) | Russian (`ru`) | Ukrainian (`uk`) |

### Text-to-Speech (TTS) Voices

For TTS, the language is determined automatically based on the prefix of the selected voice.

| Language | Voice Prefix | Examples |

| :--- | :--- | :--- |

| **English (US)** | `af_`, `am_` | `af_alloy`, `af_sky`, `am_adam`, `am_echo` |

| **English (UK)** | `bf_`, `bm_` | `bf_alice`, `bm_daniel` |

| **Spanish** | `ef_`, `em_` | `ef_dora`, `em_alex` |

| **French** | `ff_` | `ff_siwis` |

| **Hindi** | `hf_`, `hm_` | `hf_alpha`, `hm_psi` |

| **Italian** | `if_`, `im_` | `if_sara`, `im_nicola` |

| **Japanese** | `jf_`, `jm_` | `jf_alpha`, `jm_kumo` |

| **Portuguese** | `pf_`, `pm_` | `pf_dora`, `pm_santa` |

| **Chinese** | `zf_`, `zm_` | `zf_xiaobei`, `zm_yunxi` |

---

## Requirements

- **Operating System:** Linux, macOS, or Windows.

- **Audio Format:** For STT, input must be in **WAV** format (16kHz mono recommended).

- **Disk Space:** Approximately 1.5GB for models and dependencies.

## Acknowledgments

- [sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) for the underlying inference engine.

- [Kokoro](https://github.com/hexgrad/Kokoro) for the TTS weights.

- [NVIDIA](https://nvidia.com) for the Parakeet TDT ASR models.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eja/s2s

Awesome Lists containing this project

README