https://github.com/yapit-tts/yapit
Listen to anything. TTS for documents, papers, and web pages.
https://github.com/yapit-tts/yapit
document-ai document-reader fastapi gemini gemini-api markdown-converter markdown-viewer pdf-document-processor react self-hosted text-to-speech tts tts-gui yolo
Last synced: 23 days ago
JSON representation
Listen to anything. TTS for documents, papers, and web pages.
- Host: GitHub
- URL: https://github.com/yapit-tts/yapit
- Owner: yapit-tts
- License: agpl-3.0
- Created: 2025-04-19T13:41:58.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2026-03-30T22:47:21.000Z (28 days ago)
- Last Synced: 2026-03-31T00:32:24.106Z (28 days ago)
- Topics: document-ai, document-reader, fastapi, gemini, gemini-api, markdown-converter, markdown-viewer, pdf-document-processor, react, self-hosted, text-to-speech, tts, tts-gui, yolo
- Language: Python
- Homepage: https://yapit.md
- Size: 6.77 MB
- Stars: 10
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

**yapit**: Listen to anything. Open-source TTS for documents, web pages, and text.
[Website](https://yapit.md) | [CLI](https://github.com/yapit-tts/yapit-cli) | [Self-Host](#self-hosting) | [Architecture](docs/architecture.md)
[](https://github.com/yapit-tts/yapit/stargazers)
[](https://github.com/yapit-tts/yapit/actions/workflows/deploy.yml)
[](LICENSE)

---
Paste a URL or upload a PDF. Yapit renders the document and reads it aloud.
- Handles the documents other TTS tools can't: academic papers with math, citations, figures, tables, messy formatting. Equations get spoken descriptions, citations become prose, page noise is skipped. The original content displays faithfully.
- 170+ voices across 15 languages. Premium voices or free local synthesis that runs entirely in your browser, no account needed.
- Vim-style keyboard shortcuts, document outliner, media key support, adjustable speed, dark mode, share by link.
- Markdown export: append `/md` to any document URL to get clean markdown via curl. `/md-annotated` includes TTS annotations.
Powered by [Gemini](https://ai.google.dev/gemini-api), [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M), [Inworld TTS](https://inworld.ai), [DocLayout-YOLO](https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench), [defuddle](https://github.com/kepano/defuddle).
## Self-hosting
```bash
git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit
cp .env.selfhost.example .env.selfhost # edit to enable optional features (AI-extraction, custom TTS models)
make self-host
```
Open [http://localhost](http://localhost). Data persists across restarts.
To stop: `make self-host-down`.
### Multi-user mode
By default, yapit runs in **single-user mode** — no login required, all features unlocked. `.env.selfhost` is self-documenting — see the comments for optional features (AI extraction, custom TTS models).
If you want user accounts with login (e.g., for a family or small team), set `AUTH_ENABLED=true` in `.env.selfhost`, uncomment the Stack Auth section below it, and use `make self-host-auth` instead. This adds Stack Auth and ClickHouse containers. Note: in single-user mode, all requests share one user — everyone on the network sees the same document library.
### Custom TTS voices
Use any server implementing the OpenAI `/v1/audio/speech` API ([vLLM-Omni](https://github.com/vllm-project/vllm-omni), [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI), [AllTalk](https://github.com/erew123/alltalk_tts), [Chatterbox TTS](https://github.com/devnen/Chatterbox-TTS-Server), etc.).
Add to `.env.selfhost`:
```env
OPENAI_TTS_BASE_URL=http://your-tts-server:8091/v1
OPENAI_TTS_API_KEY=your-key-or-empty
OPENAI_TTS_MODEL=your-model-name
```
Voices are auto-discovered if the server supports `GET /v1/audio/voices`. Otherwise set `OPENAI_TTS_VOICES=voice1,voice2,...`.
Example: OpenAI TTS
OpenAI doesn't support voice auto-discovery, so `OPENAI_TTS_VOICES` is required.
```env
OPENAI_TTS_BASE_URL=https://api.openai.com/v1
OPENAI_TTS_API_KEY=sk-...
OPENAI_TTS_MODEL=tts-1
OPENAI_TTS_VOICES=alloy,echo,fable,nova,onyx,shimmer
```
Example: Qwen3-TTS via vLLM-Omni
Requires GPU. The default stage config assumes >=16GB VRAM. For 8GB cards (e.g., RTX 3070 Ti), create a custom config with lower sequence lengths and memory utilization — see the [stage config reference](https://docs.vllm.ai/projects/vllm-omni/en/stable/configuration/stage_configs/).
```bash
pip install vllm-omni
vllm-omni serve Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice \
--omni --port 8091 --trust-remote-code --enforce-eager \
--stage-configs-path /path/to/stage_configs.yaml # if you have low VRAM. `max_model_len: 1024` should work on 8GB
```
Then configure yapit:
```env
OPENAI_TTS_BASE_URL=http://your-gpu-host:8091/v1
OPENAI_TTS_API_KEY=EMPTY
OPENAI_TTS_MODEL=Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
```
Voices are auto-discovered from the server (9 built-in speakers for CustomVoice models).
### AI document extraction
Vision-based PDF/image processing works with any OpenAI-compatible API.
Add to `.env.selfhost`:
```env
AI_PROCESSOR=openai
AI_PROCESSOR_BASE_URL=https://openrouter.ai/api/v1 # or your vLLM/Ollama endpoint
AI_PROCESSOR_API_KEY=your-key
AI_PROCESSOR_MODEL=qwen/qwen3-vl-235b-a22b-instruct # any vision-capable model
```
Or use Google Gemini directly (with batch-mode support): `AI_PROCESSOR=gemini` + `GOOGLE_API_KEY=your-key`.
### GPU workers for Kokoro TTS & YOLO figure detection
Kokoro and YOLO run as pull-based workers — any machine with Redis access can join. Connect from the local network or via Tailscale. GPU and CPU workers run side-by-side; faster workers naturally pull more jobs. Scale by running more containers on any machine that can reach Redis.
Prereq: Docker 25+, [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) with [CDI enabled](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html), network access to the Redis instance.
```bash
# One-time GPU setup: generate CDI spec + enable CDI in Docker
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
# Add {"features": {"cdi": true}} to /etc/docker/daemon.json, then:
sudo systemctl restart docker
git clone --depth 1 https://github.com/yapit-tts/yapit.git && cd yapit
# Pull only the images you need
docker compose -f docker-compose.worker.yml pull kokoro-gpu yolo-gpu
# Start 2 Kokoro + 1 YOLO worker
REDIS_URL=redis://:6379/0 docker compose -f docker-compose.worker.yml up -d \
--scale kokoro-gpu=2 --scale yolo-gpu=1 kokoro-gpu yolo-gpu
```
Adjust `--scale` to your GPU. A 4GB card fits 2 Kokoro + 1 YOLO comfortably.
NVIDIA MPS (recommended for multiple workers per GPU)
[MPS](https://docs.nvidia.com/deploy/mps/) lets multiple workers share one GPU context — less VRAM overhead, no context switching. Without MPS, each worker gets its own CUDA context (~300MB each). The compose file mounts the MPS pipe automatically; just start the daemon.
```bash
sudo tee /etc/systemd/system/nvidia-mps.service > /dev/null <<'EOF'
[Unit]
Description=NVIDIA Multi-Process Service (MPS)
After=nvidia-persistenced.service
[Service]
Type=forking
ExecStart=/usr/bin/nvidia-cuda-mps-control -d
ExecStop=/bin/sh -c 'echo quit | /usr/bin/nvidia-cuda-mps-control'
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now nvidia-mps
```
## Roadmap
Next:
- Support exporting audio as MP3.
- Support word-level highlighting for kokoro english
Later:
- Support thinking parameter for Gemini
- Support temperature parameter for Inworld
- Support AI-transform for websites.
## Development
```bash
uv sync # install Python dependencies
npm install --prefix frontend # install frontend dependencies
make dev-env 2>/dev/null || touch .env # decrypt secrets, or create empty .env
make dev-cpu # start backend services (Docker Compose)
cd frontend && npm run dev # start frontend
make test-local # run tests
```
See [agent/knowledge/dev-setup.md](agent/knowledge/dev-setup.md) for full setup instructions.
The `agent/knowledge/` directory is the project's in-depth knowledge base, maintained jointly with Claude during development.