https://github.com/piplus2/longreads-rag
Retrieval-Augmented Generation system for querying 3000+ long-read sequencing papers from PubMed/PMC
https://github.com/piplus2/longreads-rag
chromadb fastapi longreads longreadsequencing mlflow rag rag-chatbot
Last synced: 20 days ago
JSON representation
Retrieval-Augmented Generation system for querying 3000+ long-read sequencing papers from PubMed/PMC
- Host: GitHub
- URL: https://github.com/piplus2/longreads-rag
- Owner: piplus2
- License: mit
- Created: 2026-05-20T06:08:25.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-28T14:11:07.000Z (28 days ago)
- Last Synced: 2026-05-28T14:17:15.550Z (28 days ago)
- Topics: chromadb, fastapi, longreads, longreadsequencing, mlflow, rag, rag-chatbot
- Language: Python
- Homepage:
- Size: 402 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Long-Read Sequencing Literature RAG
A Retrieval-Augmented Generation system for querying the long-read sequencing scientific literature.
Fetches papers from PubMed, PMC, and Europe PMC, builds a ChromaDB vector index, and answers natural
language questions grounded in the literature with source citations.

## Architecture
```
PubMed / PMC / Europe PMC
│
▼
src/fetch.py ← fetch abstracts + full text via Entrez & Europe PMC APIs
│
▼ data/raw/papers.json
src/index.py ← chunk → embed (BAAI/bge-small-en-v1.5) → ChromaDB collection
│ ← MLflow tracks embedding model, chunk size, corpus stats
▼ data/chromadb/
src/rag.py ← retrieve top-k chunks → build prompt → LLM answer
│
▼
app/main.py ← FastAPI: POST /ask GET /health GET /stats
│ ← MLflow tracks every query + latency
▼
frontend/ ← React/Vite UI (dark theme, demo mode, animated sources)
```
## Quickstart
```bash
# 1. Install
pip install -r requirements.txt
# 2. Set your email for NCBI (required by their API policy)
export ENTREZ_EMAIL="your@email.com"
# 3. Set your LLM API key (skip if using Ollama)
export ANTHROPIC_API_KEY="sk-..." # or OPENAI_API_KEY
# 4. Fetch papers (~3000 abstracts, +full text where available)
python -m src.fetch --fetch_full
# Optionally include Europe PMC papers (adds ~1000 more)
python -m src.fetch --fetch_full --include_europe_pmc
# 5. Build the ChromaDB index (tracked in MLflow) using the selected device "cpu" or "cuda"
python -m src.index --device cuda
# Re-index from scratch (drops and recreates the collection)
python -m src.index --reset --device cuda
# 6. Try a query from the CLI
python -m src.rag --query "What are the main error modes of Oxford Nanopore sequencing?"
# 7. Start the API
uvicorn app.main:app --reload
# 8. Start the frontend (separate terminal)
npm create vite@latest frontend -- --template react
# Replace the App.jsx from frontend/src
# then start the frontend
cd frontend && npm install && npm run dev
# Open http://localhost:5173
# 9. View MLflow experiments
mlflow ui
# Open http://localhost:5000
```
## LLM backends
The RAG pipeline supports three backends, selected via `--llm`:
| Backend | Flag | Requirement |
| ---------------- | ----------------- | --------------------------------------------------------------- |
| Ollama (default) | `--llm ollama` | `ollama pull llama3.1:8b` |
| Anthropic Claude | `--llm anthropic` | `ANTHROPIC_API_KEY` + uncomment `anthropic` in requirements.txt |
| OpenAI | `--llm openai` | `OPENAI_API_KEY` + uncomment `openai` in requirements.txt |
```bash
# Use Ollama (local, free)
ollama pull llama3.1:8b
python -m src.rag --query "..." --llm ollama
# Use Claude
python -m src.rag --query "..." --llm anthropic
# Use OpenAI
python -m src.rag --query "..." --llm openai
```
## Example API usage
```bash
curl -X POST http://localhost:8000/ask \
-H "Content-Type: application/json" \
-d '{"query": "How does PacBio HiFi compare to Nanopore for structural variant detection?", "top_k": 5}'
```
Response:
```json
{
"query": "How does PacBio HiFi compare to Nanopore...",
"answer": "Based on the literature, PacBio HiFi shows higher base accuracy (~99.9%) [1][2] while Nanopore offers...",
"latency_ms": 1240.3,
"sources": [
{
"pmid": "38291847",
"title": "Benchmarking long-read sequencing for structural variant detection",
"year": "2024",
"authors": "Li H, Feng X, Chu C",
"score": 0.8821,
"has_full": true
}
]
}
```
## Frontend
A React/Vite single-page app at `frontend/` connects to the FastAPI backend.
- Dark theme, keyboard-first (Enter to submit)
- Demo mode — toggle in the header to preview with mock data (no API required)
- Animated source cards with cosine similarity score bars
- Example query buttons for quick exploration
```bash
cd frontend
npm install
npm run dev # http://localhost:5173
```
## Docker
```bash
docker build -t longread-rag .
# Mount your data directory so the index persists
docker run -p 8000:8000 \
--network host \
-v $(pwd)/data:/app/data \
-e OLLAMA_HOST=http://172.17.0.1:11434 \
longread-rag
```
## Experiment tracking
Every indexing run and every query is logged to MLflow:
| Run type | Logged params | Logged metrics |
| -------- | ----------------------------------- | ----------------------------------------------------------- |
| Indexing | model, chunk_size, overlap, backend | n_papers, n_full_text, n_chunks, new_chunks, total_in_index |
| Query | query text, top_k | latency_ms, n_sources |
```bash
mlflow ui # http://localhost:5000
```
## Corpus sources
| Source | Coverage | Flag |
| --------------- | --------------------------- | ---------------------- |
| PubMed (Entrez) | ~3,000 abstracts | default |
| PubMed Central | full text where open-access | `--fetch_full` |
| Europe PMC | ~1,000 additional papers | `--include_europe_pmc` |
## Project structure
```
longread_rag/
├── src/
│ ├── fetch.py # PubMed/PMC/Europe PMC data collection
│ ├── index.py # Chunking, embedding, ChromaDB
│ └── rag.py # Retrieval + generation pipeline
├── app/
│ └── main.py # FastAPI endpoints
├── frontend/
│ └── src/App.jsx # React/Vite UI
├── data/
│ ├── raw/ # papers.json (gitignored)
│ └── chromadb/ # ChromaDB collection (gitignored)
├── Dockerfile
├── requirements.txt
└── README.md
```
## Author
Paolo Inglese