{"id":34570758,"url":"https://github.com/erans/vllm-jukebox","last_synced_at":"2026-05-24T20:32:38.316Z","repository":{"id":328872028,"uuid":"1117080668","full_name":"erans/vllm-jukebox","owner":"erans","description":"Server that multiplexes multiple LLM models through vLLM backends with automatic model swapping, multi-GPU scheduling, and graceful request draining","archived":false,"fork":false,"pushed_at":"2025-12-24T07:13:16.000Z","size":196,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-25T20:50:56.883Z","etag":null,"topics":["inference","vllm","vllm-jukebox"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erans.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-15T20:00:22.000Z","updated_at":"2025-12-25T20:25:44.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/erans/vllm-jukebox","commit_stats":null,"previous_names":["erans/vllm-jukebox"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/erans/vllm-jukebox","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erans%2Fvllm-jukebox","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erans%2Fvllm-jukebox/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erans%2Fvllm-jukebox/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erans%2Fvllm-jukebox/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erans","download_url":"https://codeload.github.com/erans/vllm-jukebox/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erans%2Fvllm-jukebox/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33450398,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-24T19:21:36.376Z","status":"ssl_error","status_checked_at":"2026-05-24T19:21:10.562Z","response_time":57,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inference","vllm","vllm-jukebox"],"created_at":"2025-12-24T09:35:01.303Z","updated_at":"2026-05-24T20:32:38.310Z","avatar_url":"https://github.com/erans.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vLLM Jukebox\n\nvLLM Jukebox is an OpenAI-compatible HTTP server that can run in either:\n\n- **Swap mode**: fronts a **single** vLLM instance and automatically **swaps the loaded model** based on each incoming request’s `model`.\n- **Scheduler mode**: runs **multiple concurrent** vLLM instances (one per configured GPU set + port), routes requests by `model`, and can evict non-pinned instances (LRU) to make room for larger models.\n\nThis is useful when:\n- You want one stable OpenAI-compatible endpoint, but multiple models (with swap-on-demand).\n- You’re OK with only one model being loaded at a time (no multi-instance/zero-downtime swaps).\n- You have multiple GPUs and want multiple models served concurrently (scheduler mode).\n\n## Features\n\n- OpenAI-ish endpoints: `POST /v1/chat/completions`, `POST /v1/completions`, `POST /v1/responses`\n- Anthropic protocol support: `POST /v1/messages`\n- Graceful swaps: drains in-flight requests before restarting vLLM\n- Aliases: multiple client-facing model names can point to one underlying model config\n- Prometheus metrics at `GET /metrics`\n- Operational endpoints: `GET /health`, `GET /status`\n\n## Requirements\n\n- Go (to build `jukebox`)\n- A way to run vLLM:\n  - Recommended: `uvx` (default `vllm.binary: \"uvx\"`), which runs vLLM as `uvx vllm serve ...`\n  - Alternative: set `vllm.binary: \"vllm\"` if you have `vllm` installed directly\n- A model available to vLLM:\n  - A local filesystem path (e.g. `/models/Qwen2.5-0.5B-Instruct`), or\n  - A HuggingFace model ID (e.g. `Qwen/Qwen2.5-0.5B-Instruct`)\n\nNotes:\n- Some HuggingFace models are gated (e.g. `meta-llama/*`) and require access + `HUGGING_FACE_HUB_TOKEN` (or `HF_TOKEN`).\n- vLLM needs sufficient free GPU memory; tune `gpu_memory_utilization` / `max_model_len` if you hit OOMs.\n\n## Installation\n\n### Download pre-built binary\n\nDownload the latest release from the [Releases page](https://github.com/erans/vllm-jukebox/releases):\n\n```bash\n# Download and extract\ncurl -sL https://github.com/erans/vllm-jukebox/releases/latest/download/jukebox-linux-amd64.tar.gz | tar xz\nchmod +x jukebox\n```\n\n### Build from source\n\n```bash\ngo build -o bin/jukebox ./cmd/jukebox\n```\n\n## Run\n\n```bash\n./bin/jukebox -config configs/example.yaml\n```\n\n## Configuration\n\nJukebox is configured with a single YAML file.\n\n### Minimal config\n\n```yaml\nserver:\n  host: \"127.0.0.1\"\n  port: 8080\n\nvllm:\n  # Use \"uvx\" to run `uvx vllm serve ...`\n  # You can also set this to an absolute path, e.g. \"/usr/local/bin/uvx\".\n  binary: \"uvx\"\n  port: 8000\n\nmodels:\n  qwen:\n    path: \"Qwen/Qwen2.5-0.5B-Instruct\"\n```\n\n### Multiple models + aliases\n\n```yaml\nserver:\n  host: \"0.0.0.0\"\n  port: 8080\n  log_requests: true\n\nvllm:\n  binary: \"uvx\"\n  port: 8000\n  startup_timeout: 300s\n  shutdown_timeout: 30s\n  drain_timeout: 60s\n  swap_cooldown: 30s\n  swap_wait_timeout: 60s\n  defaults:\n    gpu_memory_utilization: 0.70\n    dtype: \"float16\"\n    max_model_len: 2048\n\nbehavior:\n  # Optional: preload one model at startup.\n  default_model: \"qwen\"\n  # If true, rewrites response JSON/SSE `model` fields to match the request.\n  rewrite_model_name: true\n\nmodels:\n  qwen:\n    path: \"Qwen/Qwen2.5-0.5B-Instruct\"\n    tensor_parallel_size: 1\n\n  # Alias example (client asks for \"gpt-3.5-turbo\", but we serve qwen)\n  gpt-3.5-turbo:\n    alias: qwen\n```\n\n### Scheduler mode (multi-instance, multi-GPU)\n\nIn scheduler mode, each non-alias model declares an **exact GPU set** and a **minimum free VRAM requirement per GPU**. Jukebox allocates a unique port per running instance from the configured port range and sets `CUDA_VISIBLE_DEVICES` automatically (do not set it in `env`).\n\n```yaml\nserver:\n  host: \"0.0.0.0\"\n  port: 8080\n\nvllm:\n  binary: \"uvx\"\n  startup_timeout: 300s\n  shutdown_timeout: 30s\n  drain_timeout: 60s\n  swap_wait_timeout: 60s\n\nscheduler:\n  port_range_start: 8100\n  port_range_end: 8199\n  max_instances: 8\n  min_instance_uptime: 30s\n  nvidia_smi_binary: \"nvidia-smi\"\n\nmodels:\n  small:\n    path: \"Qwen/Qwen2.5-0.5B-Instruct\"\n    gpus: [0]\n    min_free_mem_mb_per_gpu: 4000\n\n  big:\n    path: \"/models/Meta-Llama-3-70B-Instruct\"\n    gpus: [0,1,2,3]\n    min_free_mem_mb_per_gpu: 40000\n\n  pinned-hot:\n    path: \"/models/Some-Always-On-Model\"\n    gpus: [4]\n    min_free_mem_mb_per_gpu: 16000\n    pinned: true\n```\n\nModel naming rules:\n- Client-facing model names are the YAML keys under `models:`.\n- Aliases (`alias: other_name`) let you support multiple names for the same underlying model config.\n- Jukebox forwards the *configured* `path` to vLLM (so vLLM sees a real model ID/path even if the client requested an alias).\n\n## Endpoints\n\nModel-bearing endpoints (extract `model` from body, ensure a backend instance is ready, then proxy to vLLM):\n- `POST /v1/responses`\n- `POST /v1/chat/completions`\n- `POST /v1/completions`\n- `POST /v1/embeddings`\n- `POST /v1/tokenize`\n- `POST /v1/detokenize`\n\nAnthropic endpoints (proxy to vLLM):\n- `POST /v1/messages`\n\nNotes:\n- These endpoints require a `model` field in the JSON body (matching OpenAI semantics).\n- In scheduler mode, each request is routed to the vLLM instance for that model (potentially triggering eviction/start).\n\nJukebox endpoints:\n- `GET /v1/models` (returns configured models, not vLLM’s)\n- `GET /health`\n- `GET /status` (bind to localhost / protect in production)\n- `GET /metrics` (Prometheus)\n\n## Example request\n\n```bash\ncurl -sS http://127.0.0.1:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"qwen\",\n    \"messages\": [{\"role\":\"user\",\"content\":\"Say hello in one short sentence.\"}]\n  }'\n```\n\n## Smoke tests\n\nLightweight (no real vLLM required):\n```bash\n./scripts/smoke.sh\n```\n\nScheduler mode (no real vLLM / no GPU required; uses fake `nvidia-smi` + fake vLLM server):\n```bash\n./scripts/smoke_scheduler.sh\n```\n\nReal vLLM integration (opt-in; requires `uvx` and a model):\n```bash\nRUN_VLLM_SMOKE=1 VLLM_MODEL=Qwen/Qwen2.5-0.5B-Instruct ./scripts/smoke_vllm.sh\n```\n\nUseful knobs for the real smoke test:\n- `VLLM_GPU_MEMORY_UTILIZATION`\n- `VLLM_MAX_MODEL_LEN`\n- `VLLM_DTYPE`\n- `VLLM_STARTUP_TIMEOUT_SECS`\n- `JBOX_REQUEST_TIMEOUT_SECS`\n\n## Troubleshooting\n\n- vLLM exits immediately with a gated-model error: export `HUGGING_FACE_HUB_TOKEN` (or `HF_TOKEN`) and ensure you have access.\n- vLLM fails with GPU memory errors: stop other GPU-heavy processes, or lower `gpu_memory_utilization` / `max_model_len`.\n- Model load/scheduling takes time: the triggering request can block (up to `swap_wait_timeout`); other requests that require scheduling get `503` with `Retry-After`. Requests for already-ready models continue serving.\n- Scheduler mode: do not set `CUDA_VISIBLE_DEVICES` in `vllm.default_env` or `models.\u003cname\u003e.env` (the scheduler owns it).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferans%2Fvllm-jukebox","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferans%2Fvllm-jukebox","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferans%2Fvllm-jukebox/lists"}