https://github.com/alez007/yasha
Yet another self-hosted agent
https://github.com/alez007/yasha
Last synced: 2 months ago
JSON representation
Yet another self-hosted agent
- Host: GitHub
- URL: https://github.com/alez007/yasha
- Owner: alez007
- License: mit
- Created: 2025-08-05T13:01:43.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2026-04-05T05:32:24.000Z (2 months ago)
- Last Synced: 2026-04-05T07:52:32.684Z (2 months ago)
- Language: Python
- Size: 2.25 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# Yasha
Self-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings) on a single GPU, exposing an OpenAI-compatible API. Built on [vLLM](https://github.com/vllm-project/vllm) and [Ray](https://github.com/ray-project/ray).
## Requirements
- **NVIDIA GPU** — 16 GB+ VRAM recommended for a full stack (LLM + TTS + STT + embeddings); 8 GB is sufficient for lighter setups
- **Docker** with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)
- **HuggingFace token** for gated models
## Features
- **Multi-model on a single GPU** — run chat, embedding, STT, and TTS models simultaneously with tunable per-model GPU memory allocation
- **Per-model isolated deployments** — each model runs in its own Ray Serve deployment with independent lifecycle, health checks, and failure isolation
- **OpenAI-compatible API** — drop-in replacement for any OpenAI SDK client
- **Streaming** — SSE streaming for chat completions and TTS audio
- **Tool/function calling** — auto tool choice with configurable parsers
- **Plugin system** — opt-in TTS backends installed as isolated uv workspace packages
- **Multi-GPU support** — assign models to specific GPUs by index or named Ray resource, with full tensor parallelism support
- **Client disconnect detection** — cancels in-flight inference when the client disconnects, freeing GPU resources immediately
- **Ray dashboard** — monitor deployments, resources, and request logs
## Supported OpenAI Endpoints
| Endpoint | Usecase |
|---|---|
| `POST /v1/chat/completions` | Chat / text generation (streaming and non-streaming) |
| `POST /v1/embeddings` | Text embeddings |
| `POST /v1/audio/transcriptions` | Speech-to-text |
| `POST /v1/audio/translations` | Audio translation |
| `POST /v1/audio/speech` | Text-to-speech (SSE streaming or single-response) |
| `GET /v1/models` | List available models |
## Plugin Support
Yasha's TTS system is built around a plugin architecture — each TTS backend is an opt-in package with its own isolated dependencies. Plugins ship inside this repo (`plugins/`) or can be installed from PyPI.
To enable plugins, pass them as extras at sync time:
```bash
uv sync --extra kokoro
uv sync --extra kokoro --extra orpheus # multiple plugins
```
When using Docker, set the `YASHA_PLUGINS` environment variable:
```
YASHA_PLUGINS=kokoro,orpheus
```
For a full guide on writing your own plugin, see [Plugin Development](docs/plugins.md).
## Getting Started
Pull the latest image from GHCR:
```bash
docker pull ghcr.io/alez007/yasha:latest
```
Grab an example config for your GPU and edit it to your liking:
```bash
docker run --rm ghcr.io/alez007/yasha:latest cat /yasha/config/models.example.16GB.yaml > models.yaml
```
Start the server, mounting your config and a cache directory so models are only downloaded once:
```bash
docker run --rm --shm-size=8g --gpus all \
-e HF_TOKEN=your_token_here \
-e YASHA_PLUGINS=your_plugins_here \
-v ./models.yaml:/yasha/config/models.yaml \
-v ./models-cache:/yasha/.cache/models \
-p 8265:8265 -p 8000:8000 ghcr.io/alez007/yasha:latest
```
- API: `http://localhost:8000`
- Ray dashboard: `http://localhost:8265`
Example configs are included for 8 GB, 16 GB, 24 GB, and 2×16 GB GPU setups.
## Development
See [Development](docs/development.md) for instructions on building the dev image, running with live source mounting, and attaching to the container.
## Model Configuration
See [Model Configuration](docs/model-configuration.md) for a full reference of all `models.yaml` fields, GPU pinning options, and environment variables.
## Home Assistant Integration
Yasha can serve as a voice backend for [Home Assistant](https://www.home-assistant.io/) via the Wyoming protocol. See [Home Assistant Integration](docs/home-assistant.md) for setup instructions.