An open API service indexing awesome lists of open source software.

https://github.com/smoothemerson/ragscope

Q&A over documents using RAG (FastAPI + ChromaDB + Ollama + MLflow)
https://github.com/smoothemerson/ragscope

chromadb docker fastapi langchain llm llm-evaluation mlflow ollama rag retrieval-augmented-generation self-hosted vector-database

Last synced: 4 months ago
JSON representation

Q&A over documents using RAG (FastAPI + ChromaDB + Ollama + MLflow)

Awesome Lists containing this project

README

          

# RAG API with MLflow Evaluation Dashboard

A portfolio-grade Q&A API that lets you upload PDF/text documents and ask questions about them using Retrieval-Augmented Generation (RAG). Every query is logged as an MLflow run with operational metrics and LLM-as-judge quality scores.

**Fully offline — no external API keys required.**

---

## Architecture

```
┌─────────────────────────────────────────────────────┐
│ Docker Compose │
│ │
│ ┌──────────────────────────┐ ┌──────────────┐ │
│ │ FastAPI :8000 │ │ MLflow │ │
│ │ └─ Chroma (embedded) │ │ :5000 │ │
│ └────┬─────────────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Ollama │ (qwen3.5:9b · llama3.2 · nomic-embed) │
│ │ :11434 │ │
│ └──────────┘ │
└─────────────────────────────────────────────────────┘
```

Chroma runs **embedded** inside the API container (no separate ChromaDB service). Vector data is persisted to a named Docker volume (`chroma_data`) via `CHROMA_PERSIST_DIR`.

**RAG Pipeline:**
1. User uploads a document → `POST /ingest`
2. Text is extracted, chunked (4 000 chars, 20 overlap), and embedded with `nomic-embed-text`
3. Embeddings are stored in the embedded Chroma vector store (persisted to volume)
4. User asks a question → `POST /query`
5. Question is embedded and top-k chunks retrieved from Chroma by cosine similarity
6. Retrieved chunks + question are passed to `qwen3.5:9b` (configurable via `OLLAMA_MODEL`) via a LangChain `RunnableSequence`
7. Answer is returned; metrics and quality scores are logged to MLflow under experiment `ragscope`

---

## Prerequisites

- Docker and Docker Compose installed
- ~10 GB free disk space (for Ollama models)

The `./mlflow/data` and `./mlflow/artifacts` directories are created automatically by Docker when the bind mounts are resolved on first startup.

---

## Quickstart

**Step 1 — set your hardware profile in `.env`:**

| Hardware | `COMPOSE_PROFILES` value |
|---|---|
| CPU | `cpu` |
| NVIDIA GPU | `gpu-nvidia` |
| AMD GPU (ROCm) | `gpu-amd` |

```bash
# .env
COMPOSE_PROFILES=cpu # or gpu-nvidia or gpu-amd
```

> **Warning:** `COMPOSE_PROFILES` must be exactly one of `cpu`, `gpu-nvidia`, or `gpu-amd`.
> Any other value (including leaving it blank) will cause no Ollama service to start and the API will fail to connect.

**Step 2 — start the stack:**

```bash
docker compose up
```

Wait for all three Ollama models to finish pulling (logged in `api` service output). Then:

- FastAPI docs: http://localhost:8000/docs
- MLflow UI: http://localhost:5000

---

## Example Usage

### Ingest a document

```bash
curl -X POST http://localhost:8000/ingest \
-F "file=@/path/to/your/document.pdf"
```

```json
{"status": "ok", "chunks_stored": 42, "filename": "document.pdf"}
```

### Query the RAG pipeline

```bash
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the main topic of the document?", "top_k": 4}'
```

```json
{
"answer": "The document covers...",
"sources": ["chunk text 1", "chunk text 2"]
}
```

### Health check

```bash
curl http://localhost:8000/health
```

```json
{"status": "ok", "chromadb": "ok", "ollama": "ok"}
```

---

## MLflow Dashboard

Every call to `POST /query` creates one MLflow run under the **ragscope** experiment.

Access the dashboard at **http://localhost:5000** → select `ragscope` experiment.

Each run logs:
- **GenAI Quality Scores** (via MLflow GenAI scorers, evaluated by `llama3.2`):
- `retrieval_groundedness` — is the answer grounded in the retrieved context?
- `answer_relevancy` — does the answer address the question?
- `hallucination` — does the answer contain information not supported by context?
- `safety` — is the answer free of harmful content?

Quality scores use a separate LLM judge (`llama3.2`) via MLflow GenAI's built-in scorers (`RetrievalGroundedness`, `AnswerRelevancy`, `Hallucination`, `Safety`).

---

## Environment Variables

| Variable | Default | Description |
|-----------------------|-----------------------|------------------------------------------|
| `OLLAMA_MODEL` | `qwen3.5:9b` | Ollama model for answer generation |
| `OLLAMA_JUDGE_MODEL` | `llama3.2` | Ollama model for LLM-as-judge scoring |
| `OLLAMA_EMBED_MODEL` | `nomic-embed-text` | Ollama model for embeddings |
| `CHROMA_PERSIST_DIR` | `/chroma/data` | Path inside the container where Chroma persists its data (mounted to `chroma_data` volume) |
| `MLFLOW_TRACKING_URI` | `http://mlflow:5000` | MLflow tracking server URI |

Override any variable by setting it before running `docker compose up`:

```bash
OLLAMA_MODEL=llama3.1 docker compose up
```

---

## How It Works

1. **Document Ingestion** (`POST /ingest`):
- File uploaded as `multipart/form-data`
- PDF → `PyPDFLoader.load_and_split()`; TXT → `TextLoader`
- Split with `RecursiveCharacterTextSplitter` (chunk_size=4 000, overlap=20)
- Embedded with `nomic-embed-text` via Ollama
- Stored in embedded Chroma (persisted to `chroma_data` volume)

2. **Query** (`POST /query`):
- Question embedded with `nomic-embed-text`
- Top-k chunks retrieved from Chroma by cosine similarity
- LangChain `RunnableSequence` (`PromptTemplate | ChatOllama`) runs `qwen3.5:9b` (or `OLLAMA_MODEL`) with retrieved context
- Answer extracted from `AIMessage.content` and returned with source chunks

3. **MLflow Logging**:
- Experiment name: `ragscope`
- `autolog()` enabled on startup via `src/tracking/setup.py`
- MLflow GenAI `evaluate()` runs scorers (`RetrievalGroundedness`, `AnswerRelevancy`, `Hallucination`, `Safety`) using judge model (`llama3.2`)
- All traces and scores visible in MLflow UI under the GenAI section

4. **Model Warm-up**:
- On startup, the API pulls all three Ollama models via `POST /api/pull`
- FastAPI does not accept requests until all models are confirmed available