{"id":17973832,"url":"https://github.com/prgrmcode/retrieval-based-qa-llm","last_synced_at":"2026-04-16T04:01:27.778Z","repository":{"id":259987680,"uuid":"879981438","full_name":"prgrmcode/retrieval-based-qa-llm","owner":"prgrmcode","description":"This repository contains a Jupyter notebook that demonstrates how to build a retrieval-based question-answering system using LangChain and Hugging Face. The notebook guides you through the process of setting up the environment, loading and processing documents, generating embeddings, and querying the system to retrieve relevant info from documents.","archived":false,"fork":false,"pushed_at":"2024-10-31T15:21:08.000Z","size":43,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-03T23:29:05.956Z","etag":null,"topics":["gemma2-2b","huggingface","huggingface-transformers","langchain","llama3","llm","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prgrmcode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-28T22:36:10.000Z","updated_at":"2024-10-31T15:21:11.000Z","dependencies_parsed_at":"2024-10-29T00:18:53.237Z","dependency_job_id":"95f58a82-a562-4122-82c0-8ee0fab09463","html_url":"https://github.com/prgrmcode/retrieval-based-qa-llm","commit_stats":null,"previous_names":["prgrmcode/retrieval-based-qa-llm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/prgrmcode/retrieval-based-qa-llm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prgrmcode%2Fretrieval-based-qa-llm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prgrmcode%2Fretrieval-based-qa-llm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prgrmcode%2Fretrieval-based-qa-llm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prgrmcode%2Fretrieval-based-qa-llm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prgrmcode","download_url":"https://codeload.github.com/prgrmcode/retrieval-based-qa-llm/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prgrmcode%2Fretrieval-based-qa-llm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31870516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T15:24:51.572Z","status":"online","status_checked_at":"2026-04-16T02:00:06.042Z","response_time":69,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gemma2-2b","huggingface","huggingface-transformers","langchain","llama3","llm","python"],"created_at":"2024-10-29T17:03:31.272Z","updated_at":"2026-04-16T04:01:27.693Z","avatar_url":"https://github.com/prgrmcode.png","language":"Jupyter Notebook","readme":"# Retrieval-Based Question Answering with LangChain and Hugging Face\n\nThis repository contains a Jupyter notebook that demonstrates how to build a retrieval-based question-answering system using LangChain and Hugging Face. The notebook guides you through the process of setting up the environment, loading and processing documents, generating embeddings, and querying the system to retrieve relevant documents.\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Setup](#setup)\n  - [Parsing Documents](#parsing-documents)\n  - [Generating Embeddings](#generating-embeddings)\n  - [Retrieving Relevant Documents](#retrieving-relevant-documents)\n  - [Using LangChain](#using-langchain)\n- [Examples](#examples)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Installation\n\nTo set up the environment and install the necessary dependencies, follow these steps:\n\n1. **Clone the repository**:\n\n   ```sh\n   git clone https://github.com/prgrmcode/retrieval-based-qa-llm.git\n   cd retrieval-based-qa-llm\n   ```\n\n2. **Create a virtual environment**:\n\n   ```sh\n   python -m venv .venv\n   source .venv/bin/activate  # On Windows use `.venv\\Scripts\\activate`\n   ```\n\n3. **Install dependencies**:\n   ```sh\n   pip install -r requirements.txt\n   ```\n\n## Usage\n\n### Setup\n\n1. **Install required packages**:\n\n   ```python\n   %pip install -Uqqq rich tiktoken wandb langchain unstructured tabulate pdf2image chromadb\n   %pip install --upgrade transformers\n   %pip install -U \"huggingface_hub[cli]\"\n   %pip install -U langchain-community\n   %pip install -U langchain_huggingface\n   %pip install sentence-transformers\n   ```\n\n2. **Login to Hugging Face**:\n\n   ```python\n   %huggingface-cli login\n   ```\n\n3. **Configure Hugging Face API token**:\n\n   ```python\n   import os\n   from getpass import getpass\n\n   if os.getenv(\"HUGGINGFACE_API_TOKEN\") is None:\n       os.environ[\"HUGGINGFACE_API_TOKEN\"] = getpass(\"Paste your Hugging Face API token from: https://huggingface.co/settings/tokens\\n\")\n\n    assert os.getenv(\"HUGGINGFACE_API_TOKEN\", \"\").startswith(\"hf_\"), \"This doesn't look like a valid Hugging Face API token\"\n    print(\"Hugging Face API token is configured\")\n   ```\n\n4. **Configure W\u0026B tracing**:\n   ```python\n   os.environ[\"LANGCHAIN_WANDB_TRACING\"] = \"true\"\n   os.environ[\"WANDB_PROJECT\"] = \"llmapps\"\n   ```\n\n### Parsing Documents\n\n1. **Load documents from the specified directory**:\n\n   ```python\n   import time\n   from langchain_community.document_loaders import DirectoryLoader, TextLoader\n\n   def find_md_files(directory):\n       start_time = time.time()\n       loader = DirectoryLoader(directory, glob=\"**/*.md\", loader_cls=TextLoader, show_progress=True)\n       documents = loader.load()\n       end_time = time.time()\n       print(f\"Time taken to load documents: {end_time - start_time:.2f} seconds\")\n       return documents\n\n   documents = find_md_files(directory=\"docs_sample/\")\n   print(f\"Number of documents loaded: {len(documents)}\")\n   ```\n\n2. **Count tokens in each document**:\n\n   ```python\n   def count_tokens(documents):\n       token_counts = [len(tokenizer.encode(document.page_content)) for document in documents]\n       return token_counts\n\n   token_counts = count_tokens(documents)\n   print(f\"Token counts: {token_counts}\")\n   ```\n\n3. **Split documents into sections**:\n\n   ```python\n   from langchain.text_splitter import MarkdownTextSplitter\n\n   md_text_splitter = MarkdownTextSplitter(chunk_size=1000)\n   document_sections = md_text_splitter.split_documents(documents)\n   print(f\"Number of document sections: {len(document_sections)}\")\n   print(f\"Max tokens in a section: {max(count_tokens(document_sections))}\")\n   ```\n\n### Generating Embeddings\n\n1. **Initialize the tokenizer and model**:\n\n   ```python\n   import torch\n   import transformers\n\n   model_id = \"meta-llama/Meta-Llama-3-8B-Instruct\"\n   tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\n   model = transformers.AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map=\"auto\")\n   ```\n\n2. **Generate embeddings using HuggingFaceEmbeddings**:\n\n   ```python\n   from langchain.vectorstores import Chroma\n   from langchain_huggingface import HuggingFaceEmbeddings\n\n   model_name = \"sentence-transformers/all-mpnet-base-v2\"\n   model_kwargs = {\"device\": \"cuda\"}\n   encode_kwargs = {\"normalize_embeddings\": False}\n   embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs)\n\n   db = Chroma.from_documents(document_sections, embeddings)\n   ```\n\n### Retrieving Relevant Documents\n\n1. **Create a retriever from the database**:\n\n   ```python\n   retriever = db.as_retriever(search_kwargs=dict(k=3))\n   ```\n\n2. **Run a query to retrieve relevant documents**:\n\n   ```python\n   query = \"How can I share my W\u0026B report with my team members in a public W\u0026B project?\"\n   docs = retriever.invoke(query)\n\n   for doc in docs:\n       print(doc.metadata[\"source\"])\n   ```\n\n### Using LangChain\n\n1. **Create a RetrievalQA chain**:\n\n   ```python\n   from langchain.chains import RetrievalQA\n   from langchain_huggingface import HuggingFacePipeline\n   from transformers import pipeline\n   from tqdm import tqdm\n\n   pipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer, max_new_tokens=70)\n   llm = HuggingFacePipeline(pipeline=pipe)\n\n   qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=retriever)\n   ```\n\n2. **Run the query using the RetrievalQA chain**:\n\n   ```python\n   with tqdm(total=1, desc=\"Running RetrievalQA\") as pbar:\n       result = qa.run(query)\n       pbar.update(1)\n\n   display(Markdown(result))\n   ```\n\n## Examples\n\nThe `examples.txt` contains example inputs and outputs for various tasks. These examples can help you understand the expected behavior of the models and scripts.\n\n## Contributing\n\nContributions are welcome! If you have any ideas, suggestions, or improvements, please open an issue or submit a pull request.\n\n## License\n\nThis project is licensed under the MIT License.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprgrmcode%2Fretrieval-based-qa-llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprgrmcode%2Fretrieval-based-qa-llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprgrmcode%2Fretrieval-based-qa-llm/lists"}