{"id":50579113,"url":"https://github.com/piplus2/longreads-rag","last_synced_at":"2026-06-05T00:30:47.771Z","repository":{"id":359051469,"uuid":"1244286570","full_name":"piplus2/longreads-rag","owner":"piplus2","description":"Retrieval-Augmented Generation system for querying 3000+ long-read sequencing papers from PubMed/PMC","archived":false,"fork":false,"pushed_at":"2026-05-28T14:11:07.000Z","size":412,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T14:17:15.550Z","etag":null,"topics":["chromadb","fastapi","longreads","longreadsequencing","mlflow","rag","rag-chatbot"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/piplus2.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-20T06:08:25.000Z","updated_at":"2026-05-28T14:12:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/piplus2/longreads-rag","commit_stats":null,"previous_names":["piplus2/longreads-rag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/piplus2/longreads-rag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piplus2%2Flongreads-rag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piplus2%2Flongreads-rag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piplus2%2Flongreads-rag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piplus2%2Flongreads-rag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/piplus2","download_url":"https://codeload.github.com/piplus2/longreads-rag/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/piplus2%2Flongreads-rag/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33926275,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-04T02:00:06.755Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chromadb","fastapi","longreads","longreadsequencing","mlflow","rag","rag-chatbot"],"created_at":"2026-06-05T00:30:47.472Z","updated_at":"2026-06-05T00:30:47.762Z","avatar_url":"https://github.com/piplus2.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Long-Read Sequencing Literature RAG\n\nA Retrieval-Augmented Generation system for querying the long-read sequencing scientific literature.\nFetches papers from PubMed, PMC, and Europe PMC, builds a ChromaDB vector index, and answers natural\nlanguage questions grounded in the literature with source citations.\n\n![screenshot](assets/screenshot.png)\n\n## Architecture\n\n```\nPubMed / PMC / Europe PMC\n         │\n         ▼\nsrc/fetch.py        ← fetch abstracts + full text via Entrez \u0026 Europe PMC APIs\n         │\n         ▼  data/raw/papers.json\nsrc/index.py        ← chunk → embed (BAAI/bge-small-en-v1.5) → ChromaDB collection\n         │           ← MLflow tracks embedding model, chunk size, corpus stats\n         ▼  data/chromadb/\nsrc/rag.py          ← retrieve top-k chunks → build prompt → LLM answer\n         │\n         ▼\napp/main.py         ← FastAPI: POST /ask  GET /health  GET /stats\n         │           ← MLflow tracks every query + latency\n         ▼\nfrontend/           ← React/Vite UI (dark theme, demo mode, animated sources)\n```\n\n## Quickstart\n\n```bash\n# 1. Install\npip install -r requirements.txt\n\n# 2. Set your email for NCBI (required by their API policy)\nexport ENTREZ_EMAIL=\"your@email.com\"\n\n# 3. Set your LLM API key (skip if using Ollama)\nexport ANTHROPIC_API_KEY=\"sk-...\"   # or OPENAI_API_KEY\n\n# 4. Fetch papers (~3000 abstracts, +full text where available)\npython -m src.fetch --fetch_full\n\n# Optionally include Europe PMC papers (adds ~1000 more)\npython -m src.fetch --fetch_full --include_europe_pmc\n\n# 5. Build the ChromaDB index (tracked in MLflow) using the selected device \"cpu\" or \"cuda\"\npython -m src.index --device cuda\n\n# Re-index from scratch (drops and recreates the collection)\npython -m src.index --reset --device cuda\n\n# 6. Try a query from the CLI\npython -m src.rag --query \"What are the main error modes of Oxford Nanopore sequencing?\"\n\n# 7. Start the API\nuvicorn app.main:app --reload\n\n# 8. Start the frontend (separate terminal)\nnpm create vite@latest frontend -- --template react\n# Replace the App.jsx from frontend/src\n# then start the frontend\ncd frontend \u0026\u0026 npm install \u0026\u0026 npm run dev\n# Open http://localhost:5173\n\n# 9. View MLflow experiments\nmlflow ui\n# Open http://localhost:5000\n```\n\n## LLM backends\n\nThe RAG pipeline supports three backends, selected via `--llm`:\n\n| Backend          | Flag              | Requirement                                                     |\n| ---------------- | ----------------- | --------------------------------------------------------------- |\n| Ollama (default) | `--llm ollama`    | `ollama pull llama3.1:8b`                                       |\n| Anthropic Claude | `--llm anthropic` | `ANTHROPIC_API_KEY` + uncomment `anthropic` in requirements.txt |\n| OpenAI           | `--llm openai`    | `OPENAI_API_KEY` + uncomment `openai` in requirements.txt       |\n\n```bash\n# Use Ollama (local, free)\nollama pull llama3.1:8b\npython -m src.rag --query \"...\" --llm ollama\n\n# Use Claude\npython -m src.rag --query \"...\" --llm anthropic\n\n# Use OpenAI\npython -m src.rag --query \"...\" --llm openai\n```\n\n## Example API usage\n\n```bash\ncurl -X POST http://localhost:8000/ask \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"query\": \"How does PacBio HiFi compare to Nanopore for structural variant detection?\", \"top_k\": 5}'\n```\n\nResponse:\n```json\n{\n  \"query\": \"How does PacBio HiFi compare to Nanopore...\",\n  \"answer\": \"Based on the literature, PacBio HiFi shows higher base accuracy (~99.9%) [1][2] while Nanopore offers...\",\n  \"latency_ms\": 1240.3,\n  \"sources\": [\n    {\n      \"pmid\": \"38291847\",\n      \"title\": \"Benchmarking long-read sequencing for structural variant detection\",\n      \"year\": \"2024\",\n      \"authors\": \"Li H, Feng X, Chu C\",\n      \"score\": 0.8821,\n      \"has_full\": true\n    }\n  ]\n}\n```\n\n## Frontend\n\nA React/Vite single-page app at `frontend/` connects to the FastAPI backend.\n\n- Dark theme, keyboard-first (Enter to submit)\n- Demo mode — toggle in the header to preview with mock data (no API required)\n- Animated source cards with cosine similarity score bars\n- Example query buttons for quick exploration\n\n```bash\ncd frontend\nnpm install\nnpm run dev   # http://localhost:5173\n```\n\n## Docker\n\n```bash\ndocker build -t longread-rag .\n\n# Mount your data directory so the index persists\ndocker run -p 8000:8000 \\\n  --network host \\\n  -v $(pwd)/data:/app/data \\\n  -e OLLAMA_HOST=http://172.17.0.1:11434 \\\n  longread-rag\n```\n\n## Experiment tracking\n\nEvery indexing run and every query is logged to MLflow:\n\n| Run type | Logged params                       | Logged metrics                                              |\n| -------- | ----------------------------------- | ----------------------------------------------------------- |\n| Indexing | model, chunk_size, overlap, backend | n_papers, n_full_text, n_chunks, new_chunks, total_in_index |\n| Query    | query text, top_k                   | latency_ms, n_sources                                       |\n\n```bash\nmlflow ui   # http://localhost:5000\n```\n\n## Corpus sources\n\n| Source          | Coverage                    | Flag                   |\n| --------------- | --------------------------- | ---------------------- |\n| PubMed (Entrez) | ~3,000 abstracts            | default                |\n| PubMed Central  | full text where open-access | `--fetch_full`         |\n| Europe PMC      | ~1,000 additional papers    | `--include_europe_pmc` |\n\n## Project structure\n\n```\nlongread_rag/\n├── src/\n│   ├── fetch.py        # PubMed/PMC/Europe PMC data collection\n│   ├── index.py        # Chunking, embedding, ChromaDB\n│   └── rag.py          # Retrieval + generation pipeline\n├── app/\n│   └── main.py         # FastAPI endpoints\n├── frontend/\n│   └── src/App.jsx     # React/Vite UI\n├── data/\n│   ├── raw/            # papers.json (gitignored)\n│   └── chromadb/       # ChromaDB collection (gitignored)\n├── Dockerfile\n├── requirements.txt\n└── README.md\n```\n\n## Author\n\nPaolo Inglese\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpiplus2%2Flongreads-rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpiplus2%2Flongreads-rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpiplus2%2Flongreads-rag/lists"}