{"id":19516914,"url":"https://github.com/chdl17/ai-document-retrieval-using-llama","last_synced_at":"2026-01-31T07:02:55.728Z","repository":{"id":262158066,"uuid":"886389206","full_name":"chdl17/AI-Document-Retrieval-Using-Llama","owner":"chdl17","description":"This project implements an AI-powered pipeline to retrieve relevant document chunks and answer user queries based on PDF documents.","archived":false,"fork":false,"pushed_at":"2024-11-10T21:39:45.000Z","size":514,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-29T01:34:14.523Z","etag":null,"topics":["chromadb","huggingface","langchain","llama3","meta","streamlit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chdl17.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-10T21:35:26.000Z","updated_at":"2024-11-10T21:39:48.000Z","dependencies_parsed_at":"2024-11-10T22:31:00.711Z","dependency_job_id":"2eafbe13-a153-41c8-8f9c-94b88af0ba76","html_url":"https://github.com/chdl17/AI-Document-Retrieval-Using-Llama","commit_stats":null,"previous_names":["chdl17/ai-document-retrieval-using-llama"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chdl17%2FAI-Document-Retrieval-Using-Llama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chdl17%2FAI-Document-Retrieval-Using-Llama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chdl17%2FAI-Document-Retrieval-Using-Llama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chdl17%2FAI-Document-Retrieval-Using-Llama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chdl17","download_url":"https://codeload.github.com/chdl17/AI-Document-Retrieval-Using-Llama/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249155823,"owners_count":21221668,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chromadb","huggingface","langchain","llama3","meta","streamlit"],"created_at":"2024-11-11T00:01:02.343Z","updated_at":"2026-01-31T07:02:50.694Z","avatar_url":"https://github.com/chdl17.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI Document Retrieval and Question-Answering Pipeline\n\nThis project implements an AI-powered pipeline to retrieve relevant document chunks and answer user queries based on PDF documents. It leverages **LangChain**, **Hugging Face**, and **Chroma** for document processing, embedding generation, and question-answering tasks using a **Llama** language model. The solution is built as a web application using **Streamlit**.\n\n### Table of Contents\n- [Overview](#overview)\n- [Project Flow](#project-flow)\n- [Technologies](#technologies)\n- [Installation](#installation)\n- [Usage](#usage)\n- [File Structure](#file-structure)\n- [License](#license)\n\n### Overview\n\nThis pipeline takes a PDF file as input, processes it to extract relevant information, and stores the data in vector embeddings. Then, a **Llama language model** is used to answer user queries based on the extracted documents. The key components of the pipeline include:\n\n1. **PDF Document Loading**: Load PDF documents and split them into manageable chunks.\n2. **Document Vectorization**: Convert the chunks into vector embeddings using Hugging Face’s model.\n3. **Vector Store Creation**: Store these embeddings in a Chroma vector store for efficient retrieval.\n4. **Retrieval-Based Question Answering**: A **RetrievalQA** pipeline is used to retrieve relevant document chunks and answer user queries.\n5. **Interactive Streamlit UI**: Users can interact with the system to upload PDFs and query the model via a simple Streamlit web interface.\n\n### Project Flow\n\n#### Step-by-Step Process:\n\n1. **PDF Document Upload**:\n    - The user uploads a PDF document to the system via the **Streamlit interface**.\n    - The PDF is loaded using the **PyPDFLoader** class from LangChain.\n\n2. **Document Chunking**:\n    - The loaded PDF content is split into smaller, manageable chunks using **RecursiveCharacterTextSplitter**. This ensures that each chunk fits within the model's input constraints.\n\n3. **Embedding Generation**:\n    - The text chunks are embedded using the **Hugging Face embeddings** model.\n    - The embeddings are then stored in **Chroma**, a vector store, to allow for efficient retrieval during the question-answering phase.\n\n4. **Model Initialization**:\n    - The **Llama model** is loaded using **Hugging Face Transformers** and moved to either the CPU or GPU for processing.\n    - The tokenizer and model are initialized.\n\n5. **Retrieval QA Setup**:\n    - The **RetrievalQA** chain is created, linking the **Llama model** and the **Chroma vector store**.\n    - This setup ensures that the model can query the vector store for the most relevant document chunks when answering user questions.\n\n6. **User Query and Answer Generation**:\n    - When the user inputs a query, the **RetrievalQA** chain searches for the most relevant document chunks using cosine similarity.\n    - The retrieved document chunks are fed into the **Llama model**, which generates the final answer based on the information retrieved.\n\n7. **Streamlit Interface**:\n    - The user interacts with the pipeline through a simple web-based interface built using **Streamlit**.\n    - Users can upload PDFs and input their queries, receiving answers directly in the browser.\n\n### Technologies\n\n- **LangChain**: For managing document loading, text splitting, and question-answering chain.\n- **Hugging Face**: For using pre-trained models (Llama, embeddings).\n- **Chroma**: A vector store for efficient retrieval of document embeddings.\n- **Streamlit**: For building the interactive web interface.\n- **PyTorch**: For running and managing the Llama model.\n\n### Installation\n\n1. Clone the repository:\n\n    ```bash\n    git clone https://github.com/chdl17/AI-Document-Retrieval-QA.git\n    cd AI-Document-Retrieval-QA\n    ```\n\n2. Create and activate a virtual environment:\n\n    ```bash\n    python -m venv venv\n    source venv/bin/activate   # On Windows, use venv\\Scripts\\activate\n    ```\n\n3. Install dependencies:\n\n    ```bash\n    pip install -r requirements.txt\n    ```\n\n4. **Set up environment variables**:\n   - Create a `.env` file in the root directory of the project with the following content (ensure it does not contain any sensitive tokens):\n\n    ```bash\n    MODEL_NAME=\"your_model_name_here\"\n    HUGGINGFACE=\"your_huggingface_token_here\"\n    ```\n\n### Usage\n\n1. **Run the Streamlit application**:\n\n    ```bash\n    streamlit run app.py\n    ```\n\n2. **Upload a PDF**:\n   - Open your browser and navigate to the Streamlit app.\n   - Upload a PDF document.\n\n3. **Ask a question**:\n   - Once the document is loaded and processed, input a query related to the content in the uploaded PDF.\n\n4. **View the answer**:\n   - The model will return an answer based on the content of the document, sourced from the relevant sections.\n\n### File Structure\n\n```bash\n.\n├── app.py                      # Streamlit app entry point\n├── rag_pipeline.py             # Pipeline logic and functions\n├── requirements.txt            # Python dependencies\n├── .env                        # Environment variables (MODEL_NAME, HUGGINGFACE token)\n└── README.md                   # Project documentation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchdl17%2Fai-document-retrieval-using-llama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchdl17%2Fai-document-retrieval-using-llama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchdl17%2Fai-document-retrieval-using-llama/lists"}