{"id":48726817,"url":"https://github.com/alez007/modelship","last_synced_at":"2026-04-24T10:05:02.753Z","repository":{"id":309344627,"uuid":"1032537538","full_name":"alez007/modelship","owner":"alez007","description":"Self-hosted, multi-model AI inference server. Run LLMs, TTS, STT, embeddings, and image generation with an OpenAI-compatible API.","archived":false,"fork":false,"pushed_at":"2026-04-19T09:46:28.000Z","size":1796,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-19T12:32:33.409Z","etag":null,"topics":["ai","ai-platform","diffusers","embeddings","image-generation","inference","llm","openai","ray","self-hosted","self-hosted-ai","stt","tts","vllm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alez007.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-05T13:01:43.000Z","updated_at":"2026-04-19T09:46:32.000Z","dependencies_parsed_at":"2026-04-07T10:01:12.590Z","dependency_job_id":null,"html_url":"https://github.com/alez007/modelship","commit_stats":null,"previous_names":["alez007/yasha","alez007/modelship"],"tags_count":30,"template":false,"template_full_name":null,"purl":"pkg:github/alez007/modelship","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alez007%2Fmodelship","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alez007%2Fmodelship/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alez007%2Fmodelship/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alez007%2Fmodelship/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alez007","download_url":"https://codeload.github.com/alez007/modelship/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alez007%2Fmodelship/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32218294,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T09:47:08.147Z","status":"ssl_error","status_checked_at":"2026-04-24T09:46:41.165Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ai-platform","diffusers","embeddings","image-generation","inference","llm","openai","ray","self-hosted","self-hosted-ai","stt","tts","vllm"],"created_at":"2026-04-11T23:03:19.827Z","updated_at":"2026-04-24T10:05:02.736Z","avatar_url":"https://github.com/alez007.png","language":"Python","readme":"# Modelship\n\n[![CI](https://github.com/alez007/modelship/actions/workflows/ci.yml/badge.svg)](https://github.com/alez007/modelship/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)\n\nSelf-hosted, multi-model AI inference server. Runs LLMs alongside specialized models (TTS, speech-to-text, embeddings, image generation) on GPU or CPU, exposing an OpenAI-compatible API. Built on [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) with pluggable inference backends: [vLLM](https://github.com/vllm-project/vllm) for high-throughput GPU inference, [HuggingFace Transformers](https://github.com/huggingface/transformers) for CPU and lightweight GPU workloads, [llama.cpp](https://github.com/abetlen/llama-cpp-python) for high-efficiency GGUF models on CPU, [Diffusers](https://github.com/huggingface/diffusers) for image generation, and a plugin system for custom backends.\n\n## Why Modelship?\n\nMost self-hosted inference tools focus on running a single model. Modelship is for when you need **multiple models running simultaneously** — an LLM, a TTS engine, a speech-to-text model, an embedding model, and an image generator — all behind a single OpenAI-compatible API, with fine-grained control over GPU memory allocation across them.\n\n- **One server, many models** — run a full AI stack (chat + TTS + STT + embeddings + image gen) on a single machine instead of juggling separate services\n- **GPU memory control** — allocate exact GPU fractions per model (e.g. 70% for the LLM, 5% for TTS) so everything fits on your hardware\n- **Mix and match backends** — use vLLM for high-throughput GPU inference, Transformers or llama.cpp for CPU-only workloads, Diffusers for images, and plugins for custom backends — in the same deployment\n- **Drop-in OpenAI replacement** — any OpenAI SDK client works out of the box, making it easy to integrate with existing apps and tools like [Home Assistant](docs/home-assistant.md)\n\n## Architecture\n\n```mermaid\ngraph TD\n    Client[\"Client (OpenAI SDK / curl)\"]\n    API[\"FastAPI Gateway\u003cbr/\u003eOpenAI-compatible API\u003cbr/\u003e:8000\"]\n\n    Client --\u003e|HTTP| API\n    API --\u003e|round-robin| LLM_GPU\n    API --\u003e|round-robin| LLM_CPU\n    API --\u003e|round-robin| TTS\n    API --\u003e|round-robin| STT\n    API --\u003e|round-robin| EMB\n    API --\u003e|round-robin| IMG\n\n    subgraph GPU0[\"GPU 0 — vLLM\"]\n        LLM_GPU[\"LLM Deployment\u003cbr/\u003ee.g. Llama 3.1 8B\u003cbr/\u003e70% GPU\"]\n        TTS[\"TTS Deployment\u003cbr/\u003ee.g. Kokoro 82M\u003cbr/\u003e5% GPU\"]\n    end\n\n    subgraph GPU1[\"GPU 1 — Mixed backends\"]\n        STT[\"STT Deployment (vLLM)\u003cbr/\u003ee.g. Whisper Large\u003cbr/\u003e50% GPU\"]\n        EMB[\"Embedding Deployment\u003cbr/\u003ee.g. Nomic Embed\u003cbr/\u003e50% GPU\"]\n    end\n\n    subgraph CPU[\"CPU — Transformers / llama.cpp\"]\n        LLM_CPU[\"LLM Deployment\u003cbr/\u003ee.g. Qwen3-0.6B\u003cbr/\u003eCPU-only\"]\n        STT_CPU[\"STT Deployment\u003cbr/\u003ee.g. Whisper Small\u003cbr/\u003eCPU-only\"]\n    end\n\n    subgraph GPU2[\"GPU 2 — Diffusers\"]\n        IMG[\"Image Generation\u003cbr/\u003ee.g. SDXL Turbo\u003cbr/\u003e35% GPU\"]\n    end\n```\n\nEach model runs as an isolated [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) deployment with its own lifecycle, health checks, and resource budget. Five inference backends are available:\n\n| Backend | Best for | GPU required |\n|---|---|---|\n| **vLLM** | High-throughput chat, embeddings, transcription | Yes |\n| **llama.cpp** | High-efficiency quantized GGUF models (chat, embeddings) | No |\n| **Transformers** | Chat, embeddings, transcription, TTS on CPU or lightweight GPU | No |\n| **Diffusers** | Image generation | Yes |\n| **Custom (plugins)** | TTS backends (Kokoro, Bark, Orpheus) | No |\n\nModels can be deployed across multiple GPUs, run on CPU-only, or both — multiple deployments of the same model (e.g. one on GPU via vLLM, one on CPU via Transformers) are load-balanced with round-robin routing. Each deployment can also scale horizontally with `num_replicas`.\n...\n\n## Requirements\n\n- **Docker** (or Python 3.12+ with `uv` for local development)\n- **NVIDIA GPU** (optional) — 16 GB+ VRAM recommended for a full stack (LLM + TTS + STT + embeddings) via vLLM; 8 GB is sufficient for lighter setups. Not required when using the Transformers backend on CPU\n- **[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)** — required only when running GPU models in Docker\n- **HuggingFace token** for gated models\n\n## Features\n\n- **Multi-model, multi-GPU** — run chat, embedding, STT, TTS, and image generation models simultaneously across one or more GPUs with tunable per-model GPU memory allocation\n- **CPU-only support** — run models without a GPU using the Transformers backend (chat, embeddings, transcription, TTS). Useful for development, testing, or small models that don't need GPU acceleration\n- **Multiple inference backends** — vLLM for high-throughput GPU inference, HuggingFace Transformers for CPU and lightweight GPU workloads, Diffusers for image generation, and a plugin system for custom backends\n- **Per-model isolated deployments** — each model runs in its own Ray Serve deployment with independent lifecycle, health checks, failure isolation, and configurable replica count\n- **OpenAI-compatible API** — drop-in replacement for any OpenAI SDK client\n- **Streaming** — SSE streaming for chat completions and TTS audio\n- **Tool/function calling** — auto tool choice with configurable parsers\n- **Plugin system** — opt-in TTS backends installed as isolated uv workspace packages\n- **Multi-GPU \u0026 hybrid routing** — assign models to specific GPUs or run them on CPU-only; deploy the same model on both GPU and CPU and requests are load-balanced via round-robin; full tensor parallelism support for large models spanning multiple GPUs\n- **Client disconnect detection** — cancels in-flight inference when the client disconnects, freeing GPU resources immediately\n- **Prometheus metrics \u0026 Grafana dashboard** — built-in observability with custom `modelship:*` metrics, vLLM engine stats, and Ray cluster metrics on a single scrape endpoint; pre-built Grafana dashboard included\n- **Ray dashboard** — monitor deployments, resources, and request logs\n\n## Supported OpenAI Endpoints\n\n| Endpoint | Usecase |\n|---|---|\n| `POST /v1/chat/completions` | Chat / text generation (streaming and non-streaming) |\n| `POST /v1/embeddings` | Text embeddings |\n| `POST /v1/audio/transcriptions` | Speech-to-text |\n| `POST /v1/audio/translations` | Audio translation |\n| `POST /v1/audio/speech` | Text-to-speech (SSE streaming or single-response) |\n| `POST /v1/images/generations` | Image generation |\n| `GET /v1/models` | List available models |\n\n## Quick Start\n\nThe fastest way to try Modelship: run a quantized 7B chat model on a laptop — no GPU required. Copy-paste this block and you'll have an OpenAI-compatible API on `http://localhost:8000` in a few minutes (first run downloads ~4.5 GB of weights into `./models-cache`).\n\n```bash\nmkdir -p models-cache \u0026\u0026 cat \u003e models.yaml \u003c\u003c'EOF'\nmodels:\n  - name: qwen\n    model: lmstudio-community/Qwen2.5-7B-Instruct-GGUF\n    usecase: generate\n    loader: llama_cpp\n    num_cpus: 3\n    llama_cpp_config:\n      hf_filename: \"*Q4_K_M.gguf\"\nEOF\n\ndocker run --rm --shm-size=8g \\\n  -v ./models.yaml:/modelship/config/models.yaml \\\n  -v ./models-cache:/.cache \\\n  -p 8000:8000 \\\n  ghcr.io/alez007/modelship:latest-cpu\n```\n\nImages are multi-arch (amd64 + arm64), so this works on Apple Silicon and ARM Linux hosts too.\n\nOnce the server is up (look for `Deployed app 'modelship api' successfully`), call it:\n\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\": \"qwen\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}'\n```\n\nOr point any OpenAI SDK at it — no code changes, just swap `base_url`:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8000/v1\", api_key=\"not-needed\")\nresp = client.chat.completions.create(\n    model=\"qwen\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n)\nprint(resp.choices[0].message.content)\n```\n\n### GPU (vLLM, Diffusers)\n\nFor high-throughput GPU inference, use the standard image and add `--gpus all`. You'll also need the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) and an `HF_TOKEN` for gated models. Example `models.yaml` entries for vLLM, Diffusers, and multi-GPU setups live in [docs/model-configuration.md](docs/model-configuration.md); ready-to-run configs are in [config/examples/](config/examples/).\n\n```bash\ndocker run --rm --shm-size=8g --gpus all \\\n  -e HF_TOKEN=your_token_here \\\n  -e RAY_HEAD_GPU_NUM=1 \\\n  -v ./models.yaml:/modelship/config/models.yaml \\\n  -v ./models-cache:/.cache \\\n  -p 8000:8000 \\\n  ghcr.io/alez007/modelship:latest\n```\n\nHitting an error? Check [docs/troubleshooting.md](docs/troubleshooting.md).\n\n## Plugin Support\n\nModelship's TTS system is built around a plugin architecture — each TTS backend is an opt-in package with its own isolated dependencies. Plugins ship inside this repo (`plugins/`) or can be installed from PyPI.\n\nTo enable plugins, pass them as extras at sync time:\n\n```bash\nuv sync --extra kokoro\nuv sync --extra kokoro --extra orpheus  # multiple plugins\n```\n\nWhen using Docker, set the `MSHIP_PLUGINS` environment variable:\n\n```\nMSHIP_PLUGINS=kokoro,orpheus\n```\n\nFor a full guide on writing your own plugin, see [Plugin Development](docs/plugins.md).\n\n## Documentation\n\n- [Development](docs/development.md) — dev environment setup, building, and running locally\n- [Model Configuration](docs/model-configuration.md) — full `models.yaml` reference, GPU pinning, environment variables\n- [Architecture](docs/architecture.md) — system design, request lifecycle, plugin loading\n- [Plugin Development](docs/plugins.md) — writing custom TTS backends\n- [Home Assistant Integration](docs/home-assistant.md) — Wyoming protocol setup for voice automation\n- [Monitoring \u0026 Logging](docs/monitoring.md) — Prometheus metrics, Grafana dashboard, structured logging, health checks\n- [Troubleshooting](docs/troubleshooting.md) — common first-run errors and fixes\n- [Roadmap](ROADMAP.md) — what's planned next and where to contribute\n\n## Monitoring\n\nModelship exposes Prometheus metrics (Ray cluster, Ray Serve, vLLM, and custom `modelship:*` metrics) through a single scrape endpoint on port 8079. Metrics are **enabled by default** — set `MSHIP_METRICS=false` to disable. A pre-built Grafana dashboard is included.\n\nLogging supports structured JSON output (`MSHIP_LOG_FORMAT=json`) and request ID correlation across Ray actor boundaries. Logs can be shipped to a remote syslog server (`--log-target syslog://host:514`) or an OpenTelemetry collector (`--otel-endpoint http://collector:4317`). Set `MSHIP_LOG_LEVEL` to `TRACE` for full request/response payloads, or `DEBUG` for detailed diagnostics without payloads.\n\nSee [Monitoring \u0026 Logging](docs/monitoring.md) for full details.\n\n## Production Readiness\n\nModelship is actively used but not yet hardened for production. Key gaps today: no rate limiting, `/health` is a no-op, thin test coverage, no Helm chart, no Prometheus alerting rules. See the full [Production Readiness Plan](docs/production-readiness.md) for the scorecard and roadmap.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on setting up the dev environment, code style, and submitting pull requests.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falez007%2Fmodelship","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falez007%2Fmodelship","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falez007%2Fmodelship/lists"}