{"id":29792276,"url":"https://github.com/armanjscript/fusion-rag","last_synced_at":"2026-04-10T07:43:04.516Z","repository":{"id":306245764,"uuid":"1025513265","full_name":"armanjscript/Fusion-RAG","owner":"armanjscript","description":"A powerful web-based application designed to answer questions based on the content of uploaded PDF documents. This project leverages the **Fusion-in-Decoder (FiD)** approach for **Retrieval-Augmented Generation (RAG)**, combining semantic similarity, technical term relevance, and recency to deliver accurate and contextually relevant responses","archived":false,"fork":false,"pushed_at":"2025-07-24T11:19:58.000Z","size":9,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-24T15:53:38.215Z","etag":null,"topics":["chroma","chromadb","fusion-rag","langchain","langchain-ollama","ollama","pypdf","qwen2-5","rag","rag-chatbot","scikit-learn","streamlit","tf-idf-score","tf-idf-vectorizer","vector-database"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/armanjscript.png","metadata":{"files":{"readme":"README.markdown","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-24T11:17:43.000Z","updated_at":"2025-07-24T15:47:58.000Z","dependencies_parsed_at":"2025-07-24T15:53:40.348Z","dependency_job_id":"454d348f-d787-473d-a0cb-193e8e7e23fc","html_url":"https://github.com/armanjscript/Fusion-RAG","commit_stats":null,"previous_names":["armanjscript/fusion-rag"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/armanjscript/Fusion-RAG","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/armanjscript%2FFusion-RAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/armanjscript%2FFusion-RAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/armanjscript%2FFusion-RAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/armanjscript%2FFusion-RAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/armanjscript","download_url":"https://codeload.github.com/armanjscript/Fusion-RAG/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/armanjscript%2FFusion-RAG/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267447603,"owners_count":24088640,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chroma","chromadb","fusion-rag","langchain","langchain-ollama","ollama","pypdf","qwen2-5","rag","rag-chatbot","scikit-learn","streamlit","tf-idf-score","tf-idf-vectorizer","vector-database"],"created_at":"2025-07-28T01:06:12.288Z","updated_at":"2026-04-10T07:42:59.479Z","avatar_url":"https://github.com/armanjscript.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fusion RAG: A Streamlit-Powered PDF Q\u0026A System with Fusion-in-Decoder\n\n[![GitHub Stars](https://img.shields.io/github/stars/armanjscript/Fusion-RAG?style=social)](https://github.com/armanjscript/Fusion-RAG)\n[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n\n## Description\n\nWelcome to **Fusion RAG**, a powerful web-based application designed to answer questions based on the content of uploaded PDF documents. This project leverages the **Fusion-in-Decoder (FiD)** approach for **Retrieval-Augmented Generation (RAG)**, combining semantic similarity, technical term relevance, and recency to deliver accurate and contextually relevant responses. Built with modern technologies like **Streamlit**, **LangChain**, **Chroma**, and **Ollama**, Fusion RAG is ideal for researchers, students, or professionals who need to extract insights from documents efficiently.\n\nThe application features a user-friendly interface where you can upload PDFs, ask questions in a conversational format, and receive detailed, real-time responses with citations to the source documents for transparency and verifiability.\n\n## Features\n\n| Feature | Description |\n|---------|-------------|\n| **PDF Upload \u0026 Processing** | Upload multiple PDF files, which are automatically split into chunks and indexed for querying. |\n| **Advanced Retrieval** | Uses a custom FiDRetriever that combines semantic similarity (via Chroma), technical term relevance (via TF-IDF), and recency to fetch the most relevant document chunks. |\n| **Conversational Interface** | Ask questions in a chat-like interface and receive detailed answers based on document content. |\n| **Real-Time Responses** | Answers are streamed in real-time for a seamless user experience. |\n| **Error Handling** | Robust error handling and cleanup mechanisms ensure reliable operation. |\n| **Source Citations** | Responses include references to the source documents, enhancing trust and verifiability. |\n\n## How It Works\n\nThe Fusion RAG system employs a sophisticated pipeline to process documents and generate answers. Here’s a detailed breakdown of the process:\n\n### Document Processing\n- **PDF Loading**: Uploaded PDFs are loaded using `PyPDFLoader` from LangChain.\n- **Text Splitting**: Documents are split into manageable chunks (1000 characters, 200-character overlap) using `RecursiveCharacterTextSplitter`.\n- **Metadata Tagging**: Each chunk is tagged with metadata, such as the source file name, for traceability.\n\n### Retrieval\n- **Vector Search**: Documents are embedded into a high-dimensional vector space using `OllamaEmbeddings` (model: `nomic-embed-text:latest`) and stored in a Chroma vector store. This enables semantic similarity searches.\n- **FiDRetriever**: A custom retriever that scores documents based on:\n  - **Semantic Similarity**: Using vector embeddings to find contextually relevant chunks.\n  - **Technical Term Relevance**: Using TF-IDF to prioritize chunks with important keywords.\n  - **Recency**: Prioritizing recently seen documents for relevance.\n- The top 2 most relevant documents are retrieved for each query.\n\n### Generation\n- **FiDChain**: Combines the retriever with an `OllamaLLM` (model: `qwen2.5:latest`, temperature: 0.3) to generate a detailed response.\n- The response is structured to include an analysis summary, recommended approach, reference standards, and citations to the source documents.\n\n### Diagram of the RAG Pipeline\n```mermaid\nflowchart LR\n    A([\"User Query\"])\n    A --\u003e B[\"FiDRetriever\"]\n    B --\u003e C[\"Vector Search (Chroma)\"]\n    C --\u003e D[\"Embeddings (OllamaEmbeddings)\"]\n    B --\u003e E[\"TF-IDF Scoring\"]\n    B --\u003e F[\"Recency Scoring\"]\n    C --\u003e G[\"Retrieved Documents\"]\n    E --\u003e G\n    F --\u003e G\n    G --\u003e H[\"Formatter\"]\n    H --\u003e I[\"Prompt Template\"]\n    I --\u003e J[\"LLM\"]\n    J --\u003e K[\"Output Parser\"]\n    K --\u003e L[\"Response\"]\n```\n\nThis diagram can be rendered in GitHub to visualize the pipeline from query to response.\n\n## Environment Setup\n\nTo run Fusion RAG, you’ll need to set up the following:\n\n- **Python 3.8 or later**: Ensure Python is installed on your system. Download from [python.org](https://www.python.org/downloads/).\n- **Ollama**: A tool for running large language models locally. Install it based on your operating system:\n  - **Windows**: Download the installer from [Ollama Download](https://ollama.com/download) and run it.\n  - **macOS**: Download the installer from [Ollama Download](https://ollama.com/download), unzip it, and drag the `Ollama.app` to your Applications folder.\n  - **Linux**: Run the installation script as per the [Ollama GitHub repository](https://github.com/ollama/ollama).\n- **Python Libraries**: Install the required dependencies listed in `requirements.txt`.\n\n## Installation\n\nFollow these steps to set up the project:\n\n1. **Clone the Repository**:\n   ```bash\n   git clone https://github.com/armanjscript/Fusion-RAG.git\n   ```\n2. **Navigate to the Project Directory**:\n   ```bash\n   cd Fusion-RAG\n   ```\n3. **Install Dependencies**:\n   ```bash\n   pip install -r requirements.txt\n   ```\n   The `requirements.txt` file includes dependencies like `streamlit`, `langchain`, `chromadb`, `langchain-ollama`, `scikit-learn`, and others.\n\n4. **Start Ollama**:\n   Ensure the Ollama service is running on your system. Follow the instructions from the [Ollama GitHub repository](https://github.com/ollama/ollama) to start the service.\n\n5. **Run the Streamlit App**:\n   ```bash\n   streamlit run fusion_rag.py\n   ```\n   This will launch the app in your default web browser.\n\n## Usage\n\n1. **Upload PDFs**:\n   - In the sidebar, use the file uploader to select one or more PDF files.\n   - The files are saved locally in the `uploaded_pdfs` directory and indexed in a Chroma database (`chroma_db`).\n\n2. **Ask Questions**:\n   - Enter your query in the chat input field in the main interface.\n   - The chatbot will retrieve relevant document chunks, generate a response, and stream it in real-time.\n   - Responses include citations to the source documents for reference.\n\n3. **Clear Documents**:\n   - Use the \"Clear All Documents\" button in the sidebar to reset the application by deleting uploaded files and the vector store.\n\n## Configuration\n\nThe system uses fixed parameters for optimal performance:\n- **Temperature**: Set to 0.3 for controlled randomness in the language model’s responses.\n- **Retrieval**: Retrieves the top 2 documents based on a combination of semantic similarity, TF-IDF scoring, and recency.\n\nTo modify these parameters, you can edit the `fusion_rag.py` file to adjust the `OllamaLLM` temperature or the `FiDRetriever` scoring logic.\n\n## Contributing\n\nWe welcome contributions to enhance Fusion RAG! To contribute:\n- Fork the repository.\n- Make your changes in a new branch.\n- Submit a pull request with a clear description of your changes.\n\nFor detailed guidelines, refer to the [CONTRIBUTING.md](CONTRIBUTING.md) file.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n## Contact\n\nFor questions, feedback, or collaboration opportunities, please reach out:\n- **Email**: [armannew73@gmail.com]\n- **GitHub Issues**: Open an issue on this repository for bug reports or feature requests.\n\n## Acknowledgments\n\nThis project builds on the following open-source technologies:\n- [Streamlit](https://streamlit.io/) for the web interface\n- [LangChain](https://www.langchain.com/) for document processing and RAG pipeline\n- [Chroma](https://www.trychroma.com/) for vector storage\n- [Ollama](https://ollama.com/) for local language models and embeddings\n- [Scikit-learn](https://scikit-learn.org/) for TF-IDF vectorization\n\n## Citations\n- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction)\n- [Streamlit Documentation](https://streamlit.io/)\n- [Chroma Documentation](https://www.trychroma.com/)\n- [Ollama Documentation](https://ollama.com/)\n\nThank you for exploring Fusion RAG! We hope it simplifies your document analysis tasks.\n\n#AI #RAG #FiD #PDFQ\u0026A #Streamlit #LangChain #Chroma #Ollama #MachineLearning #NaturalLanguageProcessing\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farmanjscript%2Ffusion-rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farmanjscript%2Ffusion-rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farmanjscript%2Ffusion-rag/lists"}