{"id":31445955,"url":"https://github.com/semanticclimate/rag-llm-with-pdf-xml","last_synced_at":"2025-09-30T23:53:12.416Z","repository":{"id":307627524,"uuid":"1028245754","full_name":"semanticClimate/RAG-LLM-with-PDF-XML","owner":"semanticClimate","description":null,"archived":false,"fork":false,"pushed_at":"2025-08-01T09:40:54.000Z","size":87,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-04T18:59:38.677Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/semanticClimate.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-29T08:41:20.000Z","updated_at":"2025-08-01T09:40:57.000Z","dependencies_parsed_at":"2025-08-01T10:03:22.154Z","dependency_job_id":null,"html_url":"https://github.com/semanticClimate/RAG-LLM-with-PDF-XML","commit_stats":null,"previous_names":["semanticclimate/rag-llm-with-pdf-xml"],"tags_count":0,"template":false,"template_full_name":"semanticClimate/notebook-template","purl":"pkg:github/semanticClimate/RAG-LLM-with-PDF-XML","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2FRAG-LLM-with-PDF-XML","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2FRAG-LLM-with-PDF-XML/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2FRAG-LLM-with-PDF-XML/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2FRAG-LLM-with-PDF-XML/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/semanticClimate","download_url":"https://codeload.github.com/semanticClimate/RAG-LLM-with-PDF-XML/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/semanticClimate%2FRAG-LLM-with-PDF-XML/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277773147,"owners_count":25874567,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-30T02:00:09.208Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-30T23:53:10.787Z","updated_at":"2025-09-30T23:53:12.411Z","avatar_url":"https://github.com/semanticClimate.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File\n\n\u003ca href=\"https://colab.research.google.com/github/semanticClimate/RAG-LLM-with-PDF-XML/blob/main/FSCI2025_RAG_LLM_PDF.ipynb\" target=\"_blank\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" style=\"max-width: 100%;\"\u003e\n\u003c/a\u003e\n\nDOI Zenodo badge: \n\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16675979.svg)](https://doi.org/10.5281/zenodo.16675979)\n\nCitation:\n\nBarbhuiya, S., Alwi, K. K., Kumari, R., S., A., Jawed, M., Simon, W., Yadav, G., \u0026 Murray-Rust, P. (2025). RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File (0.2). Zenodo. https://doi.org/10.5281/zenodo.16675979\n\nDescription: \n\nThis notebook demonstrates how to build a semantic question-answering system over scientific PDFs using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). It enables users to upload PDFs, extract content, embed it into a vector store, and query the document using natural language.\n\n**Key Features**\n- PDF Upload \u0026 Text Extraction: Extract raw text from research papers using PyMuPDF\n- Text Chunking \u0026 Embeddings: Convert text into meaningful chunks and generate embeddings using models like sentence-transformers\n- RAG Pipeline:\n    - Store document chunks in a FAISS vector database\n    - Retrieve top-matching chunks based on user queries\n    - Generate context-aware answers with an LLM\n    - Natural Language Q\u0026A: Ask questions like “What is the main finding?” or “What methods were used?” and get accurate answers drawn directly from the paper\n\n[Link to Notebook](https://colab.research.google.com/drive/17J9wEvkQvdaeOihN3N13u_ln5Oez8ssd?usp=sharing)\n\nReviewers \u0026 review process: \\\u003cAdd reviewers and review process link\\\u003e \n\n---\n\nSoftware citation information: [CITATION.cff](CITATION.cff)\n\nLicense: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ | License information: [LICENSE](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemanticclimate%2Frag-llm-with-pdf-xml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsemanticclimate%2Frag-llm-with-pdf-xml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemanticclimate%2Frag-llm-with-pdf-xml/lists"}