https://github.com/embeddedllm/vllm-responses
https://github.com/embeddedllm/vllm-responses
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/embeddedllm/vllm-responses
- Owner: EmbeddedLLM
- License: apache-2.0
- Created: 2026-02-25T05:06:21.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-03-25T03:59:37.000Z (3 months ago)
- Last Synced: 2026-03-25T17:15:21.801Z (3 months ago)
- Language: Python
- Homepage: https://embeddedllm.github.io/vllm-responses/
- Size: 1.02 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# vLLM Responses
FastAPI gateway that exposes an OpenAI-style **Responses API** (`/v1/responses`) in front of a vLLM **OpenAI-compatible** server (`/v1/chat/completions`), with:
- SSE streaming event shape + ordering
- `previous_response_id` statefulness (ResponseStore)
- gateway-executed built-in tool: `code_interpreter`
- gateway-hosted MCP tools (`tools[].type="mcp"` with configured `server_label`)
Current MCP boundary:
- `tools[].type="mcp"` is gateway-hosted MCP resolved via `VR_MCP_CONFIG_PATH`.
- Request-declared MCP targets (`server_url`, `connector_id`) are not supported yet.
**[📚 Full User Documentation](https://embeddedllm.github.io/vllm-responses/)** (Guides, API Reference, Examples)
Design docs (maintainer-facing): `design_docs/index.md`.
## Install
The `vllm-responses` CLI is provided by the Python package in `responses/`.
**Prerequisites:** Python 3.12+ and `uv`.
### Install from a prebuilt wheel (Linux x86_64) (Recommended)
Download a prebuilt wheel (`vllm_responses-*.whl`) from GitHub Releases (preferred) or a CI run artifact, then install it:
```bash
uv venv --python=3.12
source .venv/bin/activate
uv pip install vllm
uv pip install path/to/vllm_responses-*.whl
```
On Linux x86_64 wheels, the Code Interpreter server binary is bundled, so **Bun is not required**.
Currently, wheels are only built for Linux x86_64.
Installing `vllm-responses` provides:
- `vllm-responses` for the standalone supervisor mode
- `vllm` as a CLI shim that supports `vllm serve --responses` and delegates all non-Responses paths to the upstream
`vllm` Python package
### Install from source (repo checkout) (Development)
```bash
git clone https://github.com/EmbeddedLLM/vllm-responses
cd vllm-responses
uv venv --python=3.12
source .venv/bin/activate
uv pip install vllm
uv pip install -e ./responses
# Development: enable Code Interpreter via Bun fallback
# - Required for source checkouts when running with `code_interpreter` enabled (default)
cd responses/python/vllm_responses/tools/code_interpreter
bun install
export VR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1
cd -
vllm-responses --help
```
Verify installation:
```bash
vllm-responses --help
vllm --help
```
### Optional dependency sets (extras)
Install any combination via:
```bash
uv pip install -e './responses[,]'
```
Available extras:
- `docs`: MkDocs toolchain (contributors).
- `lint`: Ruff + Markdown formatting.
- `test`: Pytest + coverage + load testing tools.
- `tracing`: OpenTelemetry tracing support (only needed if you enable `VR_TRACING_ENABLED=true`).
- `build`: Package build/publish tools.
- `all`: Everything above.
## Build a wheel from source
If you want to produce a local wheel from this checkout, build from the
`responses/` package directory.
### Rebuild the bundled Code Interpreter binary (Linux x86_64 only)
This step is only needed if you want the wheel to include a freshly compiled
Code Interpreter binary.
```bash
bash scripts/ci/prebuild_code_interpreter_linux_x86_64.sh responses
```
The script writes the bundled executable under:
- `responses/python/vllm_responses/tools/code_interpreter/bin/linux/x86_64/code-interpreter-server`
### Build wheel and sdist
```bash
uv pip install -e './responses[build]'
cd responses
python -m build --wheel --sdist
```
Build artifacts are written to:
- `responses/dist/`
On Linux x86_64, wheels built after the prebuild step bundle the native Code
Interpreter binary. On other platforms, use the source-install Bun fallback or
disable Code Interpreter.
## Run
### remote-upstream gateway mode (`vllm-responses serve`)
Prereqs:
- If `code_interpreter` is enabled (default), the first start may download the Pyodide runtime (~400MB) into a cache
directory (see `VR_PYODIDE_CACHE_DIR`). This requires `tar` to be installed.
- For non-Linux platforms (or source installs without the bundled binary), you can disable the tool via
`--code-interpreter disabled`. For development you can also enable the Bun-based fallback via
`VR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1`.
External upstream (you start vLLM yourself; `/v1` is optional):
```bash
vllm-responses serve --upstream http://127.0.0.1:8457
```
The Responses endpoint is:
- `POST http://127.0.0.1:5969/v1/responses`
Remote access note:
- If you bind the gateway with `--gateway-host 0.0.0.0`, use the machine’s IP/hostname to connect (not `0.0.0.0`).
### integrated runtime (`vllm serve --responses`)
Prereq:
- install upstream `vllm` first, then install `vllm-responses` into the same environment
Example:
```bash
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3.5-0.8B \
--responses \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--host 0.0.0.0 \
--port 8457
```
CLI help:
- `vllm serve --help` shows upstream vLLM help
- `vllm serve --responses --help` shows the Responses-owned integrated flags
### Optional: ResponseStore hot cache (Redis)
`previous_response_id` hydration reads the previous response state from the DB. For multi-worker deployments, you can optionally enable a Redis-backed hot cache to reduce DB reads/latency.
Env vars (default off):
- `VR_RESPONSE_STORE_CACHE=1`
- `VR_RESPONSE_STORE_CACHE_TTL_SECONDS=3600`
Redis connection:
- `VR_REDIS_HOST`, `VR_REDIS_PORT`
## Quick smoke test (OpenAI Python SDK)
```python
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:5969/v1", api_key="dummy")
with client.responses.stream(
model="MiniMaxAI/MiniMax-M2.1",
input=[{"role": "user", "content": "You MUST call the code_interpreter tool. Execute: 2+2. Reply with ONLY the number."}],
tools=[{"type": "code_interpreter"}],
tool_choice="auto",
include=["code_interpreter_call.outputs"],
) as stream:
for evt in stream:
if getattr(evt, "type", "").endswith(".delta"):
continue
print(getattr(evt, "type", evt))
r1 = stream.get_final_response().id
with client.responses.stream(
model="MiniMaxAI/MiniMax-M2.1",
previous_response_id=r1,
input=[{"role": "user", "content": "What number did you just compute? Reply with ONLY the number."}],
tool_choice="none",
) as stream:
for evt in stream:
if getattr(evt, "type", "").endswith(".delta"):
continue
print(getattr(evt, "type", evt))
```