Projects in Awesome Lists tagged with pymupdf

https://github.com/pymupdf/pymupdf

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps

Last synced: 09 Sep 2025

https://github.com/pymupdf/PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps

Last synced: 08 Apr 2025

https://github.com/artifexsoftware/pdf2docx

Open source Python library for converting PDF to DOCX.

docx extract-table pdf-converter pdf-to-word pymupdf

Last synced: 14 May 2025

https://github.com/ArtifexSoftware/pdf2docx

Open source Python library for converting PDF to DOCX.

docx extract-table pdf-converter pdf-to-word pymupdf

Last synced: 28 Mar 2025

https://github.com/CBIhalsen/PolyglotPDF

(eBook，PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.

deepseek ebook formulas latex math openai-api pdf pymupdf translation

Last synced: 18 Aug 2025

https://github.com/cbihalsen/polyglotpdf

(eBook，PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.

deepseek ebook formulas latex math openai-api pdf pymupdf translation

Last synced: 14 May 2025

https://github.com/Krasjet/pdf.tocgen

A CLI toolset to generate table of contents for PDF files automatically.

cli pdf pdf-document pdf-files pymupdf scraping table-of-contents toc-generator

Last synced: 15 May 2025

https://github.com/krasjet/pdf.tocgen

A CLI toolset to generate table of contents for PDF files automatically.

cli pdf pdf-document pdf-files pymupdf scraping table-of-contents toc-generator

Last synced: 08 Apr 2025

https://github.com/lucasrla/remarks

Extract annotations (highlights and scribbles) from PDF, EPUB, and notebooks marked with reMarkable tablets. Export to Markdown, PDF, PNG, SVG

annotations epub highlighting markdown obsidian ocr ocrmypdf pdf pdf-converter pymupdf remarkable-tablet roamresearch svg-images zotero

Last synced: 13 Sep 2025

https://github.com/zain-bin-arshad/pdf-viewer

A Pure Python PDFViewer, which provides functionalities same as other famous PDFViewers.

fitz pdf pdf-viewer pdf-viewer-python pure-python pymupdf pysimplegui python python-pdf

Last synced: 18 Aug 2025

https://github.com/vb64/markdown-pdf

Markdown to pdf renderer

markdown markdown-it pdf pymupdf

Last synced: 04 Apr 2025

https://github.com/pancham1603/discord-pdf

View pdf files in discord text channels without downloading

discord-bot pymongo pymupdf

Last synced: 08 Oct 2025

https://github.com/ahmedtrb/pdf_highlight_extractor

A python application built with PySide6 and PyMuPDF that extracts highlighted text from PDF files and categorizes then based on the color, allowing users to save and organize highlighted content in a markdown file.

highlights-extractor pdf pymupdf pyside6 python

Last synced: 11 May 2025

https://github.com/vickypandey14/convert-pdf-into-image-by-python

This Python script converts each page of a PDF document into separate image files. It utilizes the PyMuPDF library (fitz) to handle PDF operations and the Python Imaging Library (PIL) for image processing.

pdf-converter pymupdf pymupdf-fitz python python-script

Last synced: 09 Apr 2025

https://github.com/jfriedlein/h2a_pdf-highlightedtext_to_annotation

Python tool to extract highlighted text from a pdf file and write this text into the content of each annotation

annotation docear executable highlight-text pdf pymupdf python

Last synced: 01 Mar 2025

https://github.com/jfriedlein/h2afreeplane_pdf-highlightedtext_to_freeplane_synch

Freeplane script to organise highlighted text and notes from pdf files as Freeplane mindmap

annotation docear freeplane-addon highlight-text notes pdf pymupdf

Last synced: 01 Mar 2025

https://github.com/muneeb1030/finetune-tiny-llama

Fine-tuning the Tiny Llama model to mimic my professor's writing style using the Llama Factory. The project involves data collection, preprocessing, preparation, fine-tuning, and evaluation.

data data-preparation data-preprocessing finetuning llama-factory llm pymupdf selenium-python spacy tinyllama webscraping

Last synced: 28 Dec 2025

https://github.com/elias-jhsph/scienceai

An AI-powered scientific literature search engine that uses OpenAI's language models to analyze research papers. It enables users to extract data, ask complex questions, and perform ad hoc literature reviews, handling hundreds of papers simultaneously without needing metadata.

ai data-extraction dictdatabase flask literature-review llm openai pymupdf research-project research-tool scientific-publications scientific-research

Last synced: 27 Dec 2025

https://github.com/gokulgowthams/askdocs_gen-ai

Private Document Questioning and Answering Application which can answer any question that has been asked, by uploading the desired document the user can ask questions

chromadb deeplearning docx2txt faiss generativeai huggingface langchain llama2 llamaindex openaiembeddings pdfplumber pinecone pymupdf sentencetransformer streamlit

Last synced: 18 Aug 2025

https://github.com/nlqthinh/wibuchatbot

Anime Waifu Chatbot - An AI-powered chatbot with an anime waifu personality! 🌸 Features include chatting, generating anime images, summarizing websites, reading PDFs, retrieving stock prices, and more. Built with Python, Gradio, and OpenAI's API.

anime chatbot diffusers gradio langchain openai pymupdf torch transformers yfinance

Last synced: 03 Aug 2025

https://github.com/tech-c-p/conversai

ConversAI is an innovative conversational AI framework designed for intelligent text extraction and querying across various document formats and web content, leveraging advanced natural language processing techniques.

beautifulsoup chatbot genai gradio groq langchain large-language-models llama3 mlops nlp ocr pymupdf python

Last synced: 01 Sep 2025

https://github.com/vikas-kashyap97/resume-screening

AI-Powered Research Summarizer is a web app that uses Google’s Gemini 1.5 Pro to generate tailored, clear summaries of research papers. It supports PDF uploads, multiple summary styles, and exports to DOCX or PDF.

google-generative-ai langchain-python pymupdf reportlab serper-api streamlit

Last synced: 23 Aug 2025

https://github.com/diocrafts/ai-book-summarizer

📚 AI-Powered Book PDF Knowledge Extractor & Summarizer Transform your PDF books into structured knowledge effortlessly! This tool leverages AI to analyze books page by page, extracting key insights, definitions, and concepts, and organizes them into Markdown summaries for easier study

ai ai-powered-tools automation book-summary document-analysis educational-tools knowledge-extraction machine-learning markdown natural-language-processing openai pdf pdf-processing pdf-summarization pymupdf python study-materials text-analysis text-summarization

Last synced: 14 Jul 2025

https://github.com/renan-siqueira/python-pdf-tool

This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.

mit-license pdf pdf-extractor pdf-to-text pdfminer pdfplumber pymupdf pypdf2 python

Last synced: 01 Apr 2025

https://github.com/hemaldholakiya12/pdfchat

A web app that allows users to upload PDFs and interact with them through a Q&A interface. The application extracts text from PDFs, generates embeddings, stores them in a FAISS database, and retrieves relevant information to provide context-aware answers using a large language model .

ai api cors embeddings faiss fastapi groq huggingface langchain llama3 llm pdf pdf-processing pymupdf python question-answering semantic-search text-splitting transformers vector-store

Last synced: 30 Oct 2025

https://github.com/boyac/pygamgee

PyGamgee enhances learning and decision-making with a local, open-source language model for efficient, private access.

ai ai-assistants ai-mentor ai-tutor aicpa deepseek embeddings faiss langchain memory nlp ollama on-premise openai pymupdf python self-hosted self-study vectorstore

Last synced: 24 Feb 2025

https://github.com/paolpal/pdfwizard

Toolkit for pdf editing.

bleed bleeding fitz mirror mirror-bleeding mirrorbleed pdf py pymupdf pypdf python

Last synced: 19 Jul 2025

https://github.com/benitomartin/scraping-to-sql

Open Source Contribution to Justicio Project

beautifulsoup fitz mysql pymupdf python requests

Last synced: 20 Feb 2025

https://github.com/ks6088ts-labs/extractor-python

A data extract tool written in Python

fitz gpt-4-vision openai pymupdf

Last synced: 22 Feb 2025

https://github.com/coycs/pdf-streamlit

PDF tools, written with Python, deployed on Streamlit

pdf pymupdf python streamlit

Last synced: 17 Mar 2025

https://github.com/bilalhameed248/pdf-document-extraction

Python PDF-to-HTML Converter: Transforming PDF Documents into Structured HTML Tags. - Feb 2022 - Jun 2023

document extraction fitz parser parsing pdf pymupdf pymupdf-fitz python python3

Last synced: 08 Oct 2025

https://github.com/jasoncobra3/floorplan-dimractor

A sophisticated Python pipeline for automatically extracting dimensions and cabinet codes from architectural floorplan PDFs. This tool converts various dimension formats into standardized measurements and provides structured output with visualization capabilities.

architecture-tools automation-tools blueprint-analysis cad-automation computer-vision dimension-extraction document-processing document-processing-pipeline floorplan-analysis image-processing measurement-tools opencv pdf-parser pdf-processing pdfplumber pymupdf streamlit text-detection

Last synced: 08 Oct 2025

https://github.com/gerdguerrero/study-chatbot

AI Study Assistant. Upload PDFs, chat with your study materials, and generate practice exams with answer keys. Built with Streamlit, powered by OpenAI GPT-4, and enhanced with RAG (ChromaDB) for intelligent document retrieval.

embeddings gpt-4o gpt4 gpt4o-mini langchain openai openai-api pdfplumber pymupdf pypdf2 python rag rag-chatbot streamlit streamlit-webapp

Last synced: 30 Dec 2025

https://github.com/orengrinker/pdfllm

The PDF Question Answering App uses Streamlit for a user-friendly interface where users can upload PDFs and ask questions. It employs LlamaIndex to index PDF content and PyMuPDF4LLM to parse files, enabling efficient, accurate answers based on the document’s text.

llamaindex openai pymupdf pymupdf4llm python3 streamlit

Last synced: 12 Oct 2025

https://github.com/shefreenkaur/nlp_query_documents

This repository contains two implementations of an NLP document query system that processes PDF documents and ranks them based on relevance to user queries.

easyocr naive-bayes nlp numpy ppmi pymupdf tf-idf

Last synced: 29 Dec 2025

https://github.com/prakshal0809/rag-chatbot

Developed a RAG-based chatbot for seamless integration with an e-hospital platform, enhancing response accuracy by 30% through reliable, trusted medical data sources. Processed over 500+ pages of medical data, enabling real-time symptom analysis and disease suggestions.

javascript langchain openai pinecone pymupdf python reactjs

Last synced: 16 Oct 2025

https://github.com/venkatarangan/productsdigest

A Python-based web scraper that fetches details from specified product webpages, especially Amazon product pages.

amazon beautifulsoup4 pdf-generation pymupdf selenium-python

Last synced: 15 Jun 2025

https://github.com/marek-jakub/siters

A simple .pdf file reader, written in Python.

pdf-viewer pymupdf pyside6 python3 qt6

Last synced: 09 Apr 2025

https://github.com/hreikin/pdf-toolbox

Extract content from PDF's and convert or create new documents from the content in multiple output formats.

adobe document-conversion document-converter document-creation document-creator document-extraction image-extraction pandoc pymupdf pypandoc python python3 scrapy text-extraction

Last synced: 09 Jul 2025

https://github.com/al-shwaib/book-preparation-for-printing

A web application for preparing books and magazines for offset printing. Automatically arranges PDF pages for commercial A3 printing, supporting both Arabic (RTL) and English (LTR) books. تطبيق ويب لتحضير الكتب والمجلات للطباعة على مطابع الأوفست. يقوم تلقائياً بترتيب صفحات PDF للطباعة التجارية على ورق A3، مع دعم الكتب العربية والإنجليزية.

a3-printing arabic-books book-preparation commercial-printing flask-application offset-printing order-to-print pdf-processing pymupdf rtl-support

Last synced: 20 Mar 2025

https://github.com/philippe2023/rag-question-answering-app

An AI-powered Question Answering application that uses Retrieval-Augmented Generation (RAG) to provide accurate and context-aware answers from uploaded PDF documents.

deep-translator langchain ollama pymupdf python3 streamlit transformers

Last synced: 10 Jul 2025

https://github.com/dipanshudhage/crop-and-fertiliser-recommendation-system

The Crop and Fertilizer Recommendation System leverages machine learning to assist farmers in selecting the best crops and fertilizers based on soil nutrient data. By analyzing soil test reports (images/PDFs), the system provides AI-driven recommendations for optimal crop growth and fertilizer use, tailored to the farmer’s specific soil conditions.

machine-learning pymupdf python streamlit tesseract-ocr

Last synced: 07 May 2025

https://github.com/ivan-ayub97/encorpdf_es

EncorPDF Viewer is a sleek and efficient application designed for viewing PDF documents. Tailored for those who simply want to open and navigate PDF files without unnecessary features, distractions, or intrusive ads, it offers a straightforward and hassle-free user experience.

pdf pdf-files pdf-viewer pymupdf pyqt5 pyqt5-desktop-application python python3 viewer

Last synced: 14 Jul 2025

https://github.com/nas-research/knowledge-model

Our knowledge system systematically ingests, processes, and indexes open-access life science publications. It supports internal research by providing precise question-answering and efficient retrieval from a continuously updated repository of scientific literature

accelerate aws boto3 dataingestion keras lifesciences llama llama3 llm numpy pymupdf pytorch researchsupport sqlalchemy tensorflow textextraction

Last synced: 30 Dec 2025

https://github.com/shefreenkaur/web-scraping-and-word-frequencies

This project analyzes word frequencies in BC Legislative documents using Stanford CoreNLP and Python. The program extracts text from PDF documents, processes it using natural language processing techniques, and generates a comprehensive word frequency analysis.

analytics chromedriver easyocr nlp numpy pandas pymupdf python selenium stanfordcorenlp webscraping wordfrequency

Last synced: 28 Mar 2025

https://github.com/amirlogic/pymupdf-webapp

PyMuPDF webapp based on CherryPy

cherrypy pymupdf python3

Last synced: 28 Mar 2025

https://github.com/timothy-bartlett/pymupdf

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

data-science extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction text-processing text-shaping xps

Last synced: 17 Mar 2025

https://github.com/rohit-2301/hiresense

HireSense is an AI-powered resume classifier that uses NLP and Machine Learning to predict the best-fit job role from a PDF resume. Built with Streamlit, it features a clean UI for uploading resumes and instantly suggests roles like Data Scientist, Full Stack Developer, and DevOps Engineer.

joblib ml nlp pymupdf python scikit-learn streamlit tfidfvectorizer

Last synced: 22 Jul 2025

https://github.com/esnanta/ai-chatbot-informasi-publik-berbasis-dokumen

Proyek ini merupakan prototipe awal chatbot berbasis AI yang dirancang untuk menyajikan informasi terkait regulasi.

aichatbot chatbot cross-encoder fastapi nltk pymupdf python sentence-transformers uvicorn yii2

Last synced: 29 Mar 2025

https://github.com/aphp/edspdf-mupdf

MuPDF extension for EDS-PDF

edspdf extractor mupdf pdf pymupdf

Last synced: 20 Mar 2025

https://github.com/jluster96/pdf-package-analyzer

A comprehensive Python tool for analyzing PDF files and determining the best PDF processing library for each file. The analyzer tests PDFs against multiple libraries (pypdf, PyMuPDF, pdfplumber) and provides detailed compatibility reports and recommendations.

analysis analytics compatibility ocr pdf pdf-document pdfplumber pymupdf pypdf pypdf2 python python3 quality testing text-processing text-shaping

Last synced: 05 Oct 2025

https://github.com/mohandshamada/rag-mcp

Rag MCP server for ai documentation

claude-ai langchain mcp-server pymupdf rag

Last synced: 11 Nov 2025

https://github.com/ivan-ayub97/encorpdf

EncorPDF Viewer es una aplicación de visualización eficiente diseñada para ver documentos PDF.

pdf pdf-files pdf-viewer pymupdf pyqt5 pyqt5-desktop-application python python3 viewer

Last synced: 23 Jul 2025

https://github.com/antoniotejada/srdine

Generates enhanced Dungeons and Dragons 5e SRD pdf

dnd dnd5e dungeons-and-dragons dungeons-and-dragons-5e pdf pymupdf srd5 wizards-srd

Last synced: 22 Mar 2025

https://github.com/lingesh81051/similar-template-document-matching-and-fraud-detection

An automated system for a health insurance company to streamline document processing, including template matching and fraud detection, resulting in reduction of processing time.

numpy opencv opencv-python pillow pymupdf pytesseract pytesseract-ocr python tkinter

Last synced: 06 Mar 2025

https://github.com/sdam-au/ocr

Experiments with OCR using Python.

ocr ocr-python pymupdf pytesseract tesseract tool

Last synced: 15 May 2025

https://github.com/ananthakrishnan12/resume-analyzer-using-bert

Resume Analyzer Using BERT

bert-embeddings bert-model cosine-similarity nlp-parsing pdf pdftotext pymupdf python3 spacy-nlp streamlit transformers

Last synced: 15 Mar 2025

https://github.com/esnanta/docu-query

Proyek ini merupakan prototipe awal chatbot berbasis AI yang dirancang untuk menyajikan informasi terkait regulasi.

aichatbot chatbot cross-encoder fastapi nltk pymupdf python sentence-transformers uvicorn yii2

Last synced: 31 Mar 2025

https://github.com/olonok69/nim_llamaindex

Integracion LLamaIndex with NVIDIA NIM

llamaindex nvidia-nim pymupdf pymupdf4llm python rag streamlit

Last synced: 24 Mar 2025

https://github.com/terry-li-hm/prometheus

PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude

ai-tools claude-code document-processing fastmcp mcp-server pdf-processing pdf-splitter prometheus pymupdf python text-extraction

Last synced: 03 Sep 2025

https://github.com/vaidehishyara14/ayurveda-pdf-q-a-chatbot

An intelligent chatbot that allows users to upload text-based Ayurveda PDFs and ask questions based on the content using RAG (Retrieval-Augmented Generation) combining semantic search and LLM-based responses.

embeddings fastapi fiass langchain langchain-groq llama3 llm pdf pdfprocessing pymupdf python question-answering text-splitting vector-database

Last synced: 03 Jul 2025

https://github.com/hase3b/scprag

This repository implements a Retrieval-Augmented Generation (RAG) system for the Supreme Court of Pakistan, utilizing different LLMs, embedding models, and retrieval and generation enhancement strategies. It processes SCP judgments, applies chunking, and generates legal summaries and answers based on relevant case data.

beautifulsoup4 embedding-models huggingface langchain legal-corpus llama llm mistral nlp ocr pdf2image pinecone pymupdf pytesseract regex retreival retrieval-augmented-generation selenium vectorstore

Last synced: 31 Dec 2025