{"id":50691440,"url":"https://github.com/sendhello/llm-rag-fundation","last_synced_at":"2026-06-09T03:06:43.757Z","repository":{"id":359237470,"uuid":"1244694938","full_name":"sendhello/llm-rag-fundation","owner":"sendhello","description":null,"archived":false,"fork":false,"pushed_at":"2026-05-28T12:28:20.000Z","size":160,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T14:14:38.540Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sendhello.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-20T14:02:50.000Z","updated_at":"2026-05-28T12:28:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/sendhello/llm-rag-fundation","commit_stats":null,"previous_names":["sendhello/llm-rag-fundation"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sendhello/llm-rag-fundation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sendhello%2Fllm-rag-fundation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sendhello%2Fllm-rag-fundation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sendhello%2Fllm-rag-fundation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sendhello%2Fllm-rag-fundation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sendhello","download_url":"https://codeload.github.com/sendhello/llm-rag-fundation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sendhello%2Fllm-rag-fundation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34089399,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-09T03:06:43.219Z","updated_at":"2026-06-09T03:06:43.752Z","avatar_url":"https://github.com/sendhello.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM RAG Foundation\n\nA hands-on FastAPI playground for learning the **Anthropic Claude API**: structured output via `tool_use`, token streaming over Server-Sent Events, and prompt caching. This repo is the *foundation* layer — a thin, readable codebase — on top of which a real Retrieval-Augmented Generation service will be built.\n\n\u003e Status: educational / learning log. Not production-ready.\n\n---\n\n## What you'll learn from this code\n\n- How to **force a JSON-shaped response** from Claude by passing a Pydantic JSON Schema as a `tool` and pinning `tool_choice` to it.\n- How to **stream tokens** from `messages.stream` and adapt them to the SSE `data: ... \\n\\n` framing that browsers and `EventSource` consumers expect.\n- How **ephemeral prompt caching** works in practice — including reading `cache_creation_input_tokens` vs `cache_read_input_tokens` from the usage payload.\n- How to **wire an async Anthropic client into FastAPI** through `Depends`, so each request gets a fresh repository without leaking state.\n\n---\n\n## Status \u0026 Roadmap\n\n| | |\n|---|---|\n| ✅ Done | 3 endpoints, structured extraction (haiku), SSE chat streaming (sonnet), prompt-cached code review (sonnet) |\n| 🚧 Next | Embedding pipeline, vector store, retrieval-augmented `/ask` endpoint, evals |\n\nThe `rag` in the repo name is a deliberate forward-reference — see the [Roadmap](#roadmap) section at the bottom.\n\n---\n\n## Endpoints\n\n| Method | Path | What it demonstrates |\n|---|---|---|\n| `POST` | `/extract` | Structured extraction — turns a free-form job description into a typed `JobInfo` object. |\n| `POST` | `/chat/stream` | Streaming — yields model tokens as Server-Sent Events. |\n| `POST` | `/analyze` | Code review with **prompt caching** enabled (`cache_control: ephemeral`). |\n\nInteractive docs are live at [`/docs`](http://localhost:8000/docs) (Swagger) and [`/redoc`](http://localhost:8000/redoc) once the server is running.\n\n---\n\n## Models in use\n\n| Endpoint | Claude model | Why |\n|---|---|---|\n| `/extract` | `claude-haiku-4-5` | Cheapest, fastest model — extraction is short, deterministic, schema-bound. |\n| `/chat/stream` | `claude-sonnet-4-6` | Balanced quality for open-ended chat. |\n| `/analyze` | `claude-sonnet-4-6` | Stronger reasoning for code review; prompt caching amortises re-reads of large code blobs. |\n\nThe `ClaudeModel` enum in [ai.py](ai.py) also lists `claude-opus-4-7` and `claude-mythos-preview` — they are wired but currently unused, ready to be swapped in for experiments.\n\n---\n\n## Project structure\n\n```\nllm-rag-foundation/\n├── main.py          # FastAPI app + 3 endpoints, DI via Depends(get_clause_repo)\n├── ai.py            # ClaudeRepo — async Anthropic client wrapper, one method per endpoint\n├── schema.py        # Pydantic models: JobInfo (tool input_schema), Chat, ReviewResult\n├── settings.py      # pydantic-settings — reads API_KEY env var\n├── pyproject.toml   # Poetry config, Python ^3.14\n├── .env.example     # Template — copy to .env and fill in API_KEY\n└── README.md\n```\n\n---\n\n## Setup\n\nRequires **Python 3.14** and **Poetry**.\n\n```bash\n# 1. Install dependencies\npoetry install\n\n# 2. Configure your Anthropic key\ncp .env.example .env\necho \"API_KEY=sk-ant-...\" \u003e .env   # or edit by hand\n\n# 3. Run the dev server\npoetry run uvicorn main:app --reload\n```\n\nThe server starts on `http://localhost:8000`. Open `/docs` for the Swagger UI.\n\n\u003e The env var is called `API_KEY` (not `ANTHROPIC_API_KEY`) — see [settings.py](settings.py).\n\n---\n\n## API examples (curl)\n\n### `POST /extract` — structured extraction\n\n```bash\ncurl -X POST http://localhost:8000/extract \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"text\": \"Senior Python Engineer at Acme Co. (Melbourne, hybrid). Build async services with FastAPI and Postgres. Visa sponsorship available. AUD 150-180k.\"\n  }'\n```\n\nTrimmed response:\n\n```json\n{\n  \"job_title\": \"Senior Python Engineer\",\n  \"job_type\": \"permanent_full_time\",\n  \"company_name\": \"Acme Co.\",\n  \"city\": \"Melbourne\",\n  \"job_flexibility\": \"hybrid\",\n  \"key_skills\": [\"Python\", \"FastAPI\", \"Postgres\", \"async\"],\n  \"salary\": \"AUD 150-180k\",\n  \"is_available_sponsorship\": true\n}\n```\n\n### `POST /chat/stream` — SSE streaming\n\n```bash\ncurl -N -X POST http://localhost:8000/chat/stream \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"chat_id\": \"demo-1\", \"message\": \"In one sentence, what is RAG?\"}'\n```\n\nThe `-N` disables curl's output buffering so you see tokens arrive live:\n\n```\ndata: Retrieval-Augmented\n\ndata:  Generation combines a\n\ndata:  retriever with an LLM ...\n\ndata: [DONE]\n```\n\n### `POST /analyze` — code review with prompt caching\n\nNote: `/analyze` takes `code` as a **query parameter** (it's declared as a bare `str` in the route, not a Pydantic model), so URL-encode the snippet:\n\n```bash\ncurl -X POST \"http://localhost:8000/analyze?code=def%20add(a,b):%0A%20%20%20%20return%20a+b\"\n```\n\nTrimmed response:\n\n```json\n{\n  \"reviews\": [\n    {\n      \"line\": 1,\n      \"code_of_line\": \"def add(a,b):\",\n      \"review\": \"Missing type hints; PEP 8 recommends a space after the comma.\"\n    }\n  ]\n}\n```\n\nServer logs reveal the caching effect — on the second identical call, watch `cache_read_input_tokens` jump and `cache_creation_input_tokens` drop to zero.\n\n---\n\n## Notes on Claude API usage (the non-obvious bits)\n\n**Why `tools` + `tool_choice` instead of asking for JSON in the prompt.**\nClaude's `tools` mechanism takes a JSON Schema (here, generated by `JobInfo.model_json_schema()`) and **forces** the model to emit a `tool_use` block whose `input` validates against that schema. By pinning `tool_choice={\"type\": \"tool\", \"name\": \"extract_job_info\"}` we guarantee the model invokes our tool rather than returning prose — no fragile post-hoc JSON parsing required.\n\n**Why `temperature=0.0` on extraction.**\nExtraction has a single correct answer per field. Zero temperature gives reproducible output and makes regressions easier to spot during evals.\n\n**SSE framing.**\nThe `text_stream` generator from `client.messages.stream(...)` yields plain text chunks. To make them consumable by browser `EventSource` / `htmx` SSE / OpenAI-style clients, each chunk is wrapped in `data: \u003ctext\u003e\\n\\n` and the stream is terminated with `data: [DONE]\\n\\n`. The blank line after each `data:` is part of the SSE protocol — not a typo.\n\n**Prompt caching.**\nPassing `cache_control={\"type\": \"ephemeral\"}` to `messages.create` asks Anthropic to cache the prefix of the prompt for ~5 minutes. On a cache hit you pay a discounted price on those tokens and skip the per-token compute. The two new fields in `response.usage` tell you what happened:\n\n- `cache_creation_input_tokens` — tokens **written** into the cache this turn (first call after a change).\n- `cache_read_input_tokens` — tokens **served from** the cache (subsequent calls).\n\nFor a code-review service that re-reads the same long snippet across follow-ups, this is a substantial latency and cost win.\n\n---\n\n## Roadmap\n\nThe next phase turns this foundation into a real RAG service:\n\n- **Document ingestion** — chunking strategies (fixed-size, semantic, recursive), token accounting.\n- **Embedding store** — likely PGVector or Qdrant; benchmarking recall vs cost.\n- **Retrieval layer** — hybrid search (dense + BM25), reranking with a cross-encoder.\n- **`POST /ask` endpoint** — retrieve top-k chunks, stuff them into a Claude prompt as cached context (great fit for prompt caching), stream the answer with inline citations.\n- **Evals** — a small golden set + an LLM-as-judge to track regressions as the retrieval pipeline changes.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsendhello%2Fllm-rag-fundation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsendhello%2Fllm-rag-fundation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsendhello%2Fllm-rag-fundation/lists"}