https://github.com/semanticclimate/rag-llm-with-pdf-xml

Last synced: 9 months ago
JSON representation

Host: GitHub
URL: https://github.com/semanticclimate/rag-llm-with-pdf-xml
Owner: semanticClimate
License: apache-2.0
Created: 2025-07-29T08:41:20.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-08-01T09:40:54.000Z (11 months ago)
Last Synced: 2025-09-04T18:59:38.677Z (10 months ago)
Language: Jupyter Notebook
Size: 85 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

# RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File

DOI Zenodo badge:

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16675979.svg)](https://doi.org/10.5281/zenodo.16675979)

Citation:

Barbhuiya, S., Alwi, K. K., Kumari, R., S., A., Jawed, M., Simon, W., Yadav, G., & Murray-Rust, P. (2025). RAG-LLM Pipeline for Extracting and Generating Insights from PDF/XML File (0.2). Zenodo. https://doi.org/10.5281/zenodo.16675979

Description:

This notebook demonstrates how to build a semantic question-answering system over scientific PDFs using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs). It enables users to upload PDFs, extract content, embed it into a vector store, and query the document using natural language.

**Key Features**
- PDF Upload & Text Extraction: Extract raw text from research papers using PyMuPDF
- Text Chunking & Embeddings: Convert text into meaningful chunks and generate embeddings using models like sentence-transformers
- RAG Pipeline:
- Store document chunks in a FAISS vector database
- Retrieve top-matching chunks based on user queries
- Generate context-aware answers with an LLM
- Natural Language Q&A: Ask questions like “What is the main finding?” or “What methods were used?” and get accurate answers drawn directly from the paper

[Link to Notebook](https://colab.research.google.com/drive/17J9wEvkQvdaeOihN3N13u_ln5Oez8ssd?usp=sharing)

Reviewers & review process: \

---

Software citation information: [CITATION.cff](CITATION.cff)

License: Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ | License information: [LICENSE](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/semanticclimate/rag-llm-with-pdf-xml

Awesome Lists containing this project

README