https://github.com/sendhello/llm-rag-fundation

Last synced: 17 days ago
JSON representation

Host: GitHub
URL: https://github.com/sendhello/llm-rag-fundation
Owner: sendhello
Created: 2026-05-20T14:02:50.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-05-28T12:28:20.000Z (29 days ago)
Last Synced: 2026-05-28T14:14:38.540Z (29 days ago)
Language: Python
Size: 156 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# LLM RAG Foundation

A hands-on FastAPI playground for learning the **Anthropic Claude API**: structured output via `tool_use`, token streaming over Server-Sent Events, and prompt caching. This repo is the *foundation* layer — a thin, readable codebase — on top of which a real Retrieval-Augmented Generation service will be built.

> Status: educational / learning log. Not production-ready.

---

## What you'll learn from this code

- How to **force a JSON-shaped response** from Claude by passing a Pydantic JSON Schema as a `tool` and pinning `tool_choice` to it.
- How to **stream tokens** from `messages.stream` and adapt them to the SSE `data: ... \n\n` framing that browsers and `EventSource` consumers expect.
- How **ephemeral prompt caching** works in practice — including reading `cache_creation_input_tokens` vs `cache_read_input_tokens` from the usage payload.
- How to **wire an async Anthropic client into FastAPI** through `Depends`, so each request gets a fresh repository without leaking state.

---

## Status & Roadmap

| | |
|---|---|
| ✅ Done | 3 endpoints, structured extraction (haiku), SSE chat streaming (sonnet), prompt-cached code review (sonnet) |
| 🚧 Next | Embedding pipeline, vector store, retrieval-augmented `/ask` endpoint, evals |

The `rag` in the repo name is a deliberate forward-reference — see the [Roadmap](#roadmap) section at the bottom.

---

## Endpoints

| Method | Path | What it demonstrates |
|---|---|---|
| `POST` | `/extract` | Structured extraction — turns a free-form job description into a typed `JobInfo` object. |
| `POST` | `/chat/stream` | Streaming — yields model tokens as Server-Sent Events. |
| `POST` | `/analyze` | Code review with **prompt caching** enabled (`cache_control: ephemeral`). |

Interactive docs are live at [`/docs`](http://localhost:8000/docs) (Swagger) and [`/redoc`](http://localhost:8000/redoc) once the server is running.

---

## Models in use

| Endpoint | Claude model | Why |
|---|---|---|
| `/extract` | `claude-haiku-4-5` | Cheapest, fastest model — extraction is short, deterministic, schema-bound. |
| `/chat/stream` | `claude-sonnet-4-6` | Balanced quality for open-ended chat. |
| `/analyze` | `claude-sonnet-4-6` | Stronger reasoning for code review; prompt caching amortises re-reads of large code blobs. |

The `ClaudeModel` enum in [ai.py](ai.py) also lists `claude-opus-4-7` and `claude-mythos-preview` — they are wired but currently unused, ready to be swapped in for experiments.

---

## Project structure

```
llm-rag-foundation/
├── main.py # FastAPI app + 3 endpoints, DI via Depends(get_clause_repo)
├── ai.py # ClaudeRepo — async Anthropic client wrapper, one method per endpoint
├── schema.py # Pydantic models: JobInfo (tool input_schema), Chat, ReviewResult
├── settings.py # pydantic-settings — reads API_KEY env var
├── pyproject.toml # Poetry config, Python ^3.14
├── .env.example # Template — copy to .env and fill in API_KEY
└── README.md
```

---

## Setup

Requires **Python 3.14** and **Poetry**.

```bash
# 1. Install dependencies
poetry install

# 2. Configure your Anthropic key
cp .env.example .env
echo "API_KEY=sk-ant-..." > .env # or edit by hand

# 3. Run the dev server
poetry run uvicorn main:app --reload
```

The server starts on `http://localhost:8000`. Open `/docs` for the Swagger UI.

> The env var is called `API_KEY` (not `ANTHROPIC_API_KEY`) — see [settings.py](settings.py).

---

## API examples (curl)

### `POST /extract` — structured extraction

```bash
curl -X POST http://localhost:8000/extract \
-H "Content-Type: application/json" \
-d '{
"text": "Senior Python Engineer at Acme Co. (Melbourne, hybrid). Build async services with FastAPI and Postgres. Visa sponsorship available. AUD 150-180k."
}'
```

Trimmed response:

```json
{
"job_title": "Senior Python Engineer",
"job_type": "permanent_full_time",
"company_name": "Acme Co.",
"city": "Melbourne",
"job_flexibility": "hybrid",
"key_skills": ["Python", "FastAPI", "Postgres", "async"],
"salary": "AUD 150-180k",
"is_available_sponsorship": true
}
```

### `POST /chat/stream` — SSE streaming

```bash
curl -N -X POST http://localhost:8000/chat/stream \
-H "Content-Type: application/json" \
-d '{"chat_id": "demo-1", "message": "In one sentence, what is RAG?"}'
```

The `-N` disables curl's output buffering so you see tokens arrive live:

```
data: Retrieval-Augmented

data: Generation combines a

data: retriever with an LLM ...

data: [DONE]
```

### `POST /analyze` — code review with prompt caching

Note: `/analyze` takes `code` as a **query parameter** (it's declared as a bare `str` in the route, not a Pydantic model), so URL-encode the snippet:

```bash
curl -X POST "http://localhost:8000/analyze?code=def%20add(a,b):%0A%20%20%20%20return%20a+b"
```

Trimmed response:

```json
{
"reviews": [
{
"line": 1,
"code_of_line": "def add(a,b):",
"review": "Missing type hints; PEP 8 recommends a space after the comma."
}
]
}
```

Server logs reveal the caching effect — on the second identical call, watch `cache_read_input_tokens` jump and `cache_creation_input_tokens` drop to zero.

---

## Notes on Claude API usage (the non-obvious bits)

**Why `tools` + `tool_choice` instead of asking for JSON in the prompt.**
Claude's `tools` mechanism takes a JSON Schema (here, generated by `JobInfo.model_json_schema()`) and **forces** the model to emit a `tool_use` block whose `input` validates against that schema. By pinning `tool_choice={"type": "tool", "name": "extract_job_info"}` we guarantee the model invokes our tool rather than returning prose — no fragile post-hoc JSON parsing required.

**Why `temperature=0.0` on extraction.**
Extraction has a single correct answer per field. Zero temperature gives reproducible output and makes regressions easier to spot during evals.

**SSE framing.**
The `text_stream` generator from `client.messages.stream(...)` yields plain text chunks. To make them consumable by browser `EventSource` / `htmx` SSE / OpenAI-style clients, each chunk is wrapped in `data: \n\n` and the stream is terminated with `data: [DONE]\n\n`. The blank line after each `data:` is part of the SSE protocol — not a typo.

**Prompt caching.**
Passing `cache_control={"type": "ephemeral"}` to `messages.create` asks Anthropic to cache the prefix of the prompt for ~5 minutes. On a cache hit you pay a discounted price on those tokens and skip the per-token compute. The two new fields in `response.usage` tell you what happened:

- `cache_creation_input_tokens` — tokens **written** into the cache this turn (first call after a change).
- `cache_read_input_tokens` — tokens **served from** the cache (subsequent calls).

For a code-review service that re-reads the same long snippet across follow-ups, this is a substantial latency and cost win.

---

## Roadmap

The next phase turns this foundation into a real RAG service:

- **Document ingestion** — chunking strategies (fixed-size, semantic, recursive), token accounting.
- **Embedding store** — likely PGVector or Qdrant; benchmarking recall vs cost.
- **Retrieval layer** — hybrid search (dense + BM25), reranking with a cross-encoder.
- **`POST /ask` endpoint** — retrieve top-k chunks, stuff them into a Claude prompt as cached context (great fit for prompt caching), stream the answer with inline citations.
- **Evals** — a small golden set + an LLM-as-judge to track regressions as the retrieval pipeline changes.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sendhello/llm-rag-fundation

Awesome Lists containing this project

README