https://github.com/shu-vro/learn-agent
A rag system to talk about Attention is all you need
https://github.com/shu-vro/learn-agent
agent docling langchain llm
Last synced: 7 days ago
JSON representation
A rag system to talk about Attention is all you need
- Host: GitHub
- URL: https://github.com/shu-vro/learn-agent
- Owner: shu-vro
- Created: 2026-04-13T22:34:42.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-27T19:05:25.000Z (19 days ago)
- Last Synced: 2026-05-27T20:21:40.344Z (19 days ago)
- Topics: agent, docling, langchain, llm
- Language: Python
- Homepage:
- Size: 4.91 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Multimodal RAG over Research Papers
This project builds a local multimodal RAG pipeline over one or more papers using Docling ingestion, Qdrant retrieval, and Ollama generation.
Default paper sources:
- https://arxiv.org/pdf/1706.03762
- https://arxiv.org/pdf/2603.15031
It uses:
- Docling for PDF parsing, markdown extraction, and artifact/image generation
- Qdrant (hybrid dense + sparse retrieval via `langchain-qdrant`)
- `Octen/Octen-Embedding-0.6B` for dense embeddings
- Ollama `gemma4:e2b` for response generation
- Optional formula OCR (`pix2tex` or Ollama vision, controlled by `--equation-ocr-lib`)
## Project Structure
- `src/module/upload_docs.py`: ingestion workflow from source PDF(s) into Qdrant
- `src/module/rag_agent.py`: retrieval + strict context-grounded QA/chat agent
- `src/vector_store/qdrant_store.py`: Qdrant client, collection helpers, hybrid vector store
- `src/lib/docling_lib.py`: Docling conversion, chunking, and artifact extraction
- `main.py`: CLI entrypoint (`ingest`, `ask`, `chat`)
## Prerequisites
1. Install dependencies:
```bash
uv sync
```
2. Make sure Qdrant is running (default: `localhost:6333`).
Example local run:
```bash
docker run --rm -p 6333:6333 qdrant/qdrant
```
3. Make sure Ollama is running.
4. Pull required Ollama model(s):
```bash
ollama pull gemma4:e2b
```
## Build the Index
```bash
uv run main.py ingest --rebuild
```
This ingests documents into the Qdrant collection (default: `store`) and writes extracted artifacts under `data/artifacts`.
## Ask a Single Question
```bash
uv run main.py ask "What is the core idea of scaled dot-product attention?"
```
Optional: force re-ingestion before asking.
```bash
uv run main.py ask "What is the core idea of scaled dot-product attention?" --rebuild
```
## Start Interactive Chat
```bash
uv run main.py chat
```
Optional: force re-ingestion before chat.
```bash
uv run main.py chat --rebuild
```
## Current Agent Behavior
- Answers are constrained to retrieved context; if information is missing, the agent explicitly says it could not find it in indexed context.
- Responses are streamed token-by-token in the terminal.
- Source lines are printed after each answer (`type`, `source`, `page`, `image`).
- `chat` mode keeps in-memory conversation state and enables `SummarizationMiddleware` (trigger: 500 tokens, keep last 2 messages).
- Usage metadata is printed in `ask` mode per call and aggregated at the end of `chat` mode.
## Useful Options
- Ingest multiple sources:
```bash
uv run main.py ingest --source https://arxiv.org/pdf/1706.03762 --source https://arxiv.org/pdf/2603.15031
```
- Set retrieval depth:
```bash
uv run main.py ask "Summarize encoder-decoder attention" --top-k 8
```
- Disable all vision enrichment:
```bash
uv run main.py ingest --rebuild --no-vision
```
- Disable only image descriptions:
```bash
uv run main.py ingest --rebuild --no-image-description
```
- Disable only formula transcription:
```bash
uv run main.py ingest --rebuild --no-formula-transcription
```
- Select formula OCR backend:
```bash
uv run main.py ingest --rebuild --equation-ocr-lib llm
```
Note: current retrieval in `rag_agent` reads from the default collection (`store`) during `ask/chat`.
## Help
```bash
uv run main.py --help
```
```
usage: main.py [-h] [--source SOURCES] [--collection-name COLLECTION_NAME] [--artifacts-dir ARTIFACTS_DIR]
[--embedding-model EMBEDDING_MODEL] [--llm-model LLM_MODEL] [--vision-model VISION_MODEL] [--top-k TOP_K]
[--no-vision] [--no-image-description] [--no-formula-transcription] [--equation-ocr-lib {local,llm}]
{ingest,ask,chat} ...
Multimodal RAG over Attention Is All You Need using Docling + Qdrant + Ollama.
positional arguments:
{ingest,ask,chat}
ingest Ingest the paper and write vectors to Qdrant.
ask Ask one question to the RAG agent.
chat Run an interactive RAG chat session.
options:
-h, --help show this help message and exit
--source SOURCES Paper source URL or local file path. Repeat this flag to ingest multiple papers.
--collection-name COLLECTION_NAME, --index-dir COLLECTION_NAME
Qdrant collection name for indexed paper documents.
--artifacts-dir ARTIFACTS_DIR
Directory for extracted markdown and images.
--embedding-model EMBEDDING_MODEL
SentenceTransformer embedding model name.
--llm-model LLM_MODEL
Ollama text generation model for QA.
--vision-model VISION_MODEL
Ollama vision model used for image descriptions.
--top-k TOP_K Number of retrieved chunks for each question.
--no-vision Disable all vision features (image descriptions and formula transcription).
--no-image-description
Disable image descriptions while keeping other vision features enabled.
--no-formula-transcription
Disable formula LaTeX transcription from formula images.
--equation-ocr-lib {local,llm}
Formula OCR backend for LaTeX transcription (local=pix2tex, llm=Ollama vision).
```
> [!CAUTION]
> This project is a work in progress and may contain incomplete features, bugs, or suboptimal implementations. It is intended for educational and experimental purposes only. Use at your own risk.