https://github.com/renswickd/deploy-gpt-oss-in-ec2
Local deployment of gpt-oss-20b model in AWS EC2 instance.
https://github.com/renswickd/deploy-gpt-oss-in-ec2
docker ec2-instance fastapi gpt-oss-20b gpu ollama
Last synced: about 2 months ago
JSON representation
Local deployment of gpt-oss-20b model in AWS EC2 instance.
- Host: GitHub
- URL: https://github.com/renswickd/deploy-gpt-oss-in-ec2
- Owner: renswickd
- Created: 2025-09-05T11:53:51.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-09-06T10:15:22.000Z (10 months ago)
- Last Synced: 2025-09-06T12:09:52.912Z (10 months ago)
- Topics: docker, ec2-instance, fastapi, gpt-oss-20b, gpu, ollama
- Language: Python
- Homepage:
- Size: 14.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Local LLM FastAPI App (Ollama)
A FastAPI service that calls a locally hosted Ollama model for chat-style interactions. Uses the official `ollama` Python client. Designed with simple, clean structure and reasonable best practices.
## Prerequisites
- Python 3.10+
- [Ollama](https://ollama.com) running within container (default `http://localhost:11434`)
- Considering the extensive resource usage, `llama3:8b` model available in local Ollama. (for testing the app functionalities)
## Configuration
The app reads settings from `.env` in the project root.
Required:
- Instance sizing for 20B (recommended):
- GPU: 24 GB VRAM fits 4-bit quantization comfortably (e.g., g5.2xlarge).
- CPU-only: favor 32–64 GB RAM; e.g., c7i.4xlarge (16 vCPU, 32 GB) for better headroom.
Optional:
- `OLLAMA_HOST` — default `http://ollama:11434`
- `APP_NAME` — default `Local gpt-oss-20b Chat API`
- `DEBUG` — default `false`
- `CORS_ORIGINS` — comma-separated list (defaults to `*`)
## Install & Run
```bash
python -m venv .venv
source .venv/bin/activate # Windows: .venv\\Scripts\\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000
```
## Docker
Build the image:
```bash
docker build -t local-llm-api:latest .
```
### Docker Compose
Run an Ollama container too (optional):
```bash
docker compose --profile ollama up --build
```
The API is available at `http://localhost:8000`. Health check: `GET /api/health`.
### Docker/Compose host configuration
- If both run in compose, use the Ollama service name, e.g. `OLLAMA_HOST=http://ollama:11434`.
## AWS EC2 Deployment
To deploy this app with the `gpt-oss-20b` model on AWS EC2 (instance sizing, GPU options, Compose configurations, and hardening), see:
- `docs/aws-ec2-deployment.md`
Quick starts on EC2:
- Host Ollama (CPU or GPU on host):
- Compose-managed Ollama (same box):
- `export OLLAMA_HOST=http://ollama:11434 && COMPOSE_PROFILES=ollama docker compose up --build -d`
- `docker compose --profile ollama exec ollama ollama pull gpt-oss-20b`
Helper scripts:
- `bash scripts/ollama_container_setup.sh --model gpt-oss-20b [--gpus]` to run Ollama in a container and pull the model.
## Helper Script
Run a single script to build the image, pull the model, and start services:
```bash
# Host Ollama (default):
bash scripts/build_and_pull.sh --model "$OLLAMA_MODEL"
```
Notes:
- If `--model` is omitted, the script reads `OLLAMA_MODEL` from `.env`.
## Endpoints
- `GET /` — basic info
- `GET /api/health` — app + Ollama health
- `POST /api/chat` — chat completion
### POST /api/chat
Body:
```json
{
"messages": [
{"role": "system", "content": "You are helpful assistant."},
{"role": "user", "content": "What is Langchain in one line?"}
],
"stream": false,
"temperature": 0.2
}
```
- If `stream: true`, the response is newline-delimited JSON (`application/x-ndjson`) compatible with Ollama streaming chunks.
- If `stream: false` or omitted, returns a single JSON object from Ollama.
## Notes
- Startup performs a best-effort health check but does not fail the app if Ollama is down.