{"id":29103174,"url":"https://github.com/semanticclimate/llmrag","last_synced_at":"2026-02-02T08:02:50.338Z","repository":{"id":300102233,"uuid":"1004380651","full_name":"semanticClimate/llmrag","owner":"semanticClimate","description":"A template Github repository for running your own AI LLM RAG for a literature review project","archived":false,"fork":false,"pushed_at":"2025-08-08T11:23:34.000Z","size":221952,"stargazers_count":1,"open_issues_count":3,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-08T13:14:54.436Z","etag":null,"topics":["ai","llm","rag"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/semanticClimate.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-18T14:33:25.000Z","updated_at":"2025-08-08T11:23:37.000Z","dependencies_parsed_at":"2025-08-08T13:18:14.770Z","dependency_job_id":null,"html_url":"https://github.com/semanticClimate/llmrag","commit_stats":null,"previous_names":["semanticclimate/llmrag"],"tags_count":2,"template":true,"template_full_name":null,"purl":"pkg:github/semanticClimate/llmrag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2Fllmrag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2Fllmrag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2Fllmrag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2Fllmrag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/semanticClimate","download_url":"https://codeload.github.com/semanticClimate/llmrag/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2Fllmrag/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29007381,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-02T06:37:10.400Z","status":"ssl_error","status_checked_at":"2026-02-02T06:37:09.383Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","llm","rag"],"created_at":"2025-06-28T23:08:44.873Z","updated_at":"2026-02-02T08:02:50.333Z","avatar_url":"https://github.com/semanticClimate.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IPCC RAG System 🌍📚\n\nCite as: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16779401.svg)](https://doi.org/10.5281/zenodo.16779401)\n\n**A Local Retrieval-Augmented Generation (RAG) System for IPCC Climate Reports**\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com)\n\n## 🎯 What is this?\n\nThis system helps researchers, policymakers, and students quickly find and understand information from IPCC (Intergovernmental Panel on Climate Change) reports. Think of it as a **smart research assistant** that can:\n\n- 📖 **Load IPCC chapters** from your computer\n- 🔍 **Answer questions** about climate science\n- 📍 **Show you exactly where** information comes from (paragraph IDs)\n- 🤖 **Run entirely on your computer** (no internet needed after setup)\n\n## 🚀 Quick Start (5 minutes)\n\n### 1. Install Python\nFirst, make sure you have Python 3.12+ installed:\n- **Windows**: Download from [python.org](https://www.python.org/downloads/)\n- **Mac**: Usually pre-installed, or use [Homebrew](https://brew.sh): `brew install python@3.12`\n- **Linux**: `sudo apt install python3.12 python3.12-pip`\n\n**Note**: This system requires Python 3.12 or higher for optimal performance and compatibility.\n\n### 2. Download and Setup\n\n#### Fast Installation (Recommended)\n**For Windows users experiencing slow installation:**\n\n```bash\n# Download the project\ngit clone https://github.com/yourusername/llmrag.git\ncd llmrag\n\n# Use the optimized installation script\n# Windows:\ninstall_fast.bat\n\n# Unix/Linux/Mac:\n./install_fast.sh\n```\n\n#### Standard Installation\n```bash\n# Download the project\ngit clone https://github.com/yourusername/llmrag.git\ncd llmrag\n\n# Install with optimization flags (faster)\npip install --upgrade pip setuptools wheel\npip install --use-feature=fast-deps --only-binary=all -e .\n\n# Or install required packages\npip install -r requirements.txt\n```\n\n### 3. Try it out!\n```bash\n# Start the web interface\nstreamlit run streamlit_app.py\n\n# Or use the command line\npython -m llmrag.cli list-chapters\npython -m llmrag.cli ask \"What are the main findings about temperature trends?\" --chapter wg1/chapter02\n```\n\n## 📚 Learning Resources\n\n### For Beginners\n- **What is RAG?**: [LangChain RAG Tutorial](https://python.langchain.com/docs/use_cases/question_answering/)\n- **Climate Science**: [IPCC FAQ](https://www.ipcc.ch/about/faq/)\n- **Python Basics**: [Python.org Tutorial](https://docs.python.org/3/tutorial/)\n\n### For Researchers\n- **RAG Systems**: [Retrieval-Augmented Generation Paper](https://arxiv.org/abs/2005.11401)\n- **Vector Databases**: [ChromaDB Documentation](https://docs.trychroma.com/)\n- **Embeddings**: [Sentence Transformers Guide](https://www.sbert.net/)\n\n### For Developers\n- **Streamlit**: [Streamlit Documentation](https://docs.streamlit.io/)\n- **HuggingFace**: [Transformers Tutorial](https://huggingface.co/docs/transformers/tutorials)\n- **Vector Search**: [FAISS Tutorial](https://github.com/facebookresearch/faiss/wiki)\n\n## 🎮 How to Use\n\n### Web Interface (Recommended for beginners)\n1. Run `streamlit run streamlit_app.py`\n2. Open your browser to `http://localhost:8501`\n3. Select a chapter and start asking questions!\n\n### Command Line (For power users)\n```bash\n# See available chapters\npython -m llmrag.cli list-chapters\n\n# Ask a question\npython -m llmrag.cli ask \"What causes global warming?\" --chapter wg1/chapter02\n\n# Interactive mode\npython -m llmrag.cli interactive --chapter wg1/chapter02\n```\n\n### Python Code (For developers)\n```python\nfrom llmrag.chapter_rag import ask_chapter\n\n# Ask a question about a chapter\nresult = ask_chapter(\n    question=\"What are the main climate change impacts?\",\n    chapter_name=\"wg1/chapter02\"\n)\n\nprint(f\"Answer: {result['answer']}\")\nprint(f\"Sources: {result['paragraph_ids']}\")\n```\n\n## 📁 What's Included\n\n```\nllmrag/\n├── 📖 IPCC Chapters          # Climate report data\n├── 🤖 RAG System            # Question answering engine\n├── 🌐 Web Interface         # User-friendly browser app\n├── 💻 Command Line Tools    # Power user interface\n├── 🔧 Processing Pipeline   # Data preparation tools\n└── 📊 Documentation         # Guides and tutorials\n```\n\n## 🛠️ System Components\n\n### Core RAG System\n- **Document Loading**: Processes IPCC HTML chapters\n- **Text Chunking**: Breaks documents into searchable pieces\n- **Vector Search**: Finds relevant information quickly\n- **Answer Generation**: Creates coherent responses\n- **Source Tracking**: Shows exactly where answers come from\n\n### User Interfaces\n- **Streamlit Web App**: Beautiful, interactive interface\n- **Command Line**: Fast, scriptable interface\n- **Python API**: For integration with other tools\n\n### Data Processing\n- **HTML Cleaning**: Removes formatting, keeps content\n- **Paragraph IDs**: Tracks information sources\n- **Semantic Chunking**: Keeps related information together\n\n## 🔬 Technical Details\n\n### Models Used\n- **Embeddings**: Sentence Transformers (all-MiniLM-L6-v2)\n- **Language Model**: GPT-2 Large (774M parameters)\n- **Vector Database**: ChromaDB (local storage)\n\n### Performance\n- **Speed**: Answers in 2-5 seconds\n- **Accuracy**: Based on IPCC content only\n- **Memory**: ~2GB RAM for full system\n- **Storage**: ~500MB for all chapters\n\n## 🤝 Contributing\n\nWe welcome contributions! Here's how to help:\n\n### For Non-Developers\n- 📝 **Report bugs** or suggest improvements\n- 📚 **Test the system** with your research questions\n- 📖 **Improve documentation** or write tutorials\n- 🌍 **Share with colleagues** who might find it useful\n\n### For Developers\n- 🔧 **Fix bugs** or add features\n- 🧪 **Add tests** to ensure quality\n- 📦 **Improve packaging** or deployment\n- 🚀 **Optimize performance**\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- **IPCC**: For providing the climate science reports\n- **HuggingFace**: For the language models and tools\n- **ChromaDB**: For the vector database\n- **Streamlit**: For the web interface framework\n- **Open Source Community**: For all the amazing tools we build upon\n\n## 📞 Support\n\n- **Issues**: [GitHub Issues](https://github.com/yourusername/llmrag/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/yourusername/llmrag/discussions)\n- **Email**: your.email@example.com\n\n## 📈 Roadmap\n\n- [ ] **More IPCC Chapters**: Add WG2 and WG3 reports\n- [ ] **Better Models**: Upgrade to larger language models\n- [ ] **Multi-language**: Support for non-English reports\n- [ ] **Collaborative Features**: Share questions and answers\n- [ ] **Mobile App**: iOS and Android versions\n\n---\n\n**Made with ❤️ for climate science research**\n\n## 🚀 Quickstart for Collaborators\n\n### Prerequisites\n- Python 3.9 or higher (Python 3.12 recommended)\n- Git\n\n### 1. Clone the Repository\n```bash\ngit clone https://github.com/semanticClimate/llmrag.git\ncd llmrag\n```\n\n### 2. Set Up Virtual Environment\n\n**Windows (Command Prompt):**\n```cmd\npython -m venv venv\nvenv\\Scripts\\activate\n```\n\n**Windows (PowerShell):**\n```powershell\npython -m venv venv\n.\\venv\\Scripts\\Activate.ps1\n```\n\n**macOS/Linux:**\n```bash\npython3 -m venv venv\nsource venv/bin/activate\n```\n\n### 3. Install Dependencies\n```bash\npip install -e .\npip install -r requirements.txt\n```\n\n### 4. Run Tests\n```bash\n# Run all tests\npython -m pytest tests/ -v\n\n# Run with coverage\ncoverage run --source=llmrag -m pytest tests/\ncoverage report -m\n```\n\n### 5. Test IPCC HTML Ingestion (New Feature)\n```bash\npython test_ipcc_ingestion.py\n```\n\nThis will:\n- Ingest the IPCC Chapter 4 HTML file with paragraph IDs\n- Test the RAG pipeline with climate-related queries\n- Show which paragraph IDs were used to generate answers\n\n## 🔧 Features\n\n### HTML Ingestion with Paragraph ID Tracking\nThe system now supports ingesting HTML documents and tracking paragraph IDs for source attribution:\n\n- **HTML Splitter**: Extracts text while preserving paragraph IDs from HTML elements\n- **RAG Pipeline**: Returns paragraph IDs used in generating answers\n- **Test Script**: `test_ipcc_ingestion.py` demonstrates the feature with IPCC content\n\n### Example Output\n```\nQuery: What are the main scenarios used in climate projections?\nAnswer: [Generated answer]\nParagraph IDs found: ['4.1_p3', '4.3.2.2_p2']\n```\n\n## 🐛 Troubleshooting\n\n### Slow Installation Issues\n**If `pip install -e .` takes more than 10 minutes:**\n\n1. **Use the fast installation scripts**:\n   ```bash\n   # Windows\n   install_fast.bat\n   \n   # Unix/Linux/Mac\n   ./install_fast.sh\n   ```\n\n2. **Try staged installation**:\n   ```bash\n   pip install --upgrade pip setuptools wheel\n   pip install pyyaml lxml pytest rich streamlit toml\n   pip install chromadb langchain\n   pip install --only-binary=all transformers sentence-transformers\n   pip install -e .\n   ```\n\n3. **Use conda for heavy packages** (Windows):\n   ```bash\n   conda install -c conda-forge transformers sentence-transformers\n   pip install -e .\n   ```\n\n### Windows-Specific Issues\n- **Virtual Environment**: Make sure to use the correct activation script for your shell\n- **Dependencies**: If you encounter issues with `lxml` or `transformers`, try:\n  ```bash\n  pip install lxml transformers\n  ```\n- **DLL Errors**: Ensure you have the latest Python and pip versions\n- **Visual Studio Build Tools**: Install Visual Studio Build Tools 2019+ for compilation\n\n### General Issues\n- **Python Version**: We recommend Python 3.12. Some libraries may have compatibility issues with older versions\n- **Virtual Environment**: *ALWAYS USE A VIRTUAL ENVIRONMENT* to avoid conflicts\n- **NumPy Conflicts**: If you have NumPy in your global environment, it may cause issues. Use a clean virtual environment\n- **Network Issues**: Large model downloads may timeout. Use `pip install --timeout 300` for longer timeouts\n\n## 📁 Project Structure\n\n```\nllmrag/\n├── llmrag/                    # Main package\n│   ├── chunking/             # Text splitting (including HTML)\n│   ├── embeddings/           # Embedding models\n│   ├── models/               # LLM models\n│   ├── pipelines/            # RAG pipeline\n│   └── retrievers/           # Vector stores\n├── tests/                    # Test suite\n│   └── ipcc/                # IPCC test data\n├── test_ipcc_ingestion.py   # IPCC ingestion test script\n└── requirements.txt          # Dependencies\n```\n\n## 📝 Development\n\nFor chat history and development notes, see:\n- `./project.md` - Project documentation\n- `./all_code.py` - Development history (Messy)\n\n## 🧪 Testing\n\nThe test suite includes:\n- Unit tests for all components\n- Integration tests for the RAG pipeline\n- HTML ingestion tests with paragraph ID tracking\n- IPCC content tests\n\nRun tests with:\n```bash\npython -m pytest tests/ -v\n```\n\nExpected result:\n```\n=========================================== 10 passed in 27.61s ============================================\n```\n\n# TEST\n```\ngit clone https://github.com/semanticClimate/llmrag/\n```\nThen\n```cd llmrag```\n\nsetup and activate a virtual environment\n(on Mac:\n```\npython3.12 -m venv venv\nsource venv/bin/activate\n```\nrun the tests - should take about 0.5 min\n```\npip install -r requirements.txt\ncoverage run --source=llmrag -m unittest discover -s tests\ncoverage report -m\n```\n\nresult:\n```\n..Device set to use cpu\n..Retrieved: [('Paris is the capital of France.', 0.2878604531288147)]\n.Retrieved: [('Paris is the capital of France.', 0.37026578187942505)]\n.\n----------------------------------------------------------------------\nRan 6 tests in 20.267s\n\nOK\n```\n\nto print the coverage:\n\n```\n coverage report -m\n```\n\n# BUGS\n\nWe run on Python 3.12. This can cause problems with some libraries, such as NumPy. Although `numpy` is not included in `llmrag` at present it may be in your environment. *ALWAYS USE A VIRTUAL ENVIRONMENT*\n(PMR found differences between numpy on Python 3.11 and 3.12)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemanticclimate%2Fllmrag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsemanticclimate%2Fllmrag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemanticclimate%2Fllmrag/lists"}