Projects in Awesome Lists tagged with pymupdf
A curated list of projects in awesome lists tagged with pymupdf .
https://github.com/pymupdf/pymupdf
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps
Last synced: 09 Sep 2025
https://github.com/pymupdf/PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps
Last synced: 08 Apr 2025
https://github.com/artifexsoftware/pdf2docx
Open source Python library for converting PDF to DOCX.
docx extract-table pdf-converter pdf-to-word pymupdf
Last synced: 14 May 2025
https://github.com/ArtifexSoftware/pdf2docx
Open source Python library for converting PDF to DOCX.
docx extract-table pdf-converter pdf-to-word pymupdf
Last synced: 28 Mar 2025
https://github.com/CBIhalsen/PolyglotPDF
(eBook,PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.
deepseek ebook formulas latex math openai-api pdf pymupdf translation
Last synced: 18 Aug 2025
https://github.com/cbihalsen/polyglotpdf
(eBook,PDFs Translation) A multilingual eBook processing tool supporting all eBook formats. Features online and offline translation while preserving original layouts. Compatible with both scanned and digital PDFs. Elegant user interface. The world's highest-performing open-source layout-preserving eBook translator.
deepseek ebook formulas latex math openai-api pdf pymupdf translation
Last synced: 14 May 2025
https://github.com/Krasjet/pdf.tocgen
A CLI toolset to generate table of contents for PDF files automatically.
cli pdf pdf-document pdf-files pymupdf scraping table-of-contents toc-generator
Last synced: 15 May 2025
https://github.com/krasjet/pdf.tocgen
A CLI toolset to generate table of contents for PDF files automatically.
cli pdf pdf-document pdf-files pymupdf scraping table-of-contents toc-generator
Last synced: 08 Apr 2025
https://github.com/lucasrla/remarks
Extract annotations (highlights and scribbles) from PDF, EPUB, and notebooks marked with reMarkable tablets. Export to Markdown, PDF, PNG, SVG
annotations epub highlighting markdown obsidian ocr ocrmypdf pdf pdf-converter pymupdf remarkable-tablet roamresearch svg-images zotero
Last synced: 13 Sep 2025
https://github.com/zain-bin-arshad/pdf-viewer
A Pure Python PDFViewer, which provides functionalities same as other famous PDFViewers.
fitz pdf pdf-viewer pdf-viewer-python pure-python pymupdf pysimplegui python python-pdf
Last synced: 18 Aug 2025
https://github.com/vb64/markdown-pdf
Markdown to pdf renderer
markdown markdown-it pdf pymupdf
Last synced: 04 Apr 2025
https://github.com/pancham1603/discord-pdf
View pdf files in discord text channels without downloading
Last synced: 08 Oct 2025
https://github.com/ahmedtrb/pdf_highlight_extractor
A python application built with PySide6 and PyMuPDF that extracts highlighted text from PDF files and categorizes then based on the color, allowing users to save and organize highlighted content in a markdown file.
highlights-extractor pdf pymupdf pyside6 python
Last synced: 11 May 2025
https://github.com/vickypandey14/convert-pdf-into-image-by-python
This Python script converts each page of a PDF document into separate image files. It utilizes the PyMuPDF library (fitz) to handle PDF operations and the Python Imaging Library (PIL) for image processing.
pdf-converter pymupdf pymupdf-fitz python python-script
Last synced: 09 Apr 2025
https://github.com/jfriedlein/h2a_pdf-highlightedtext_to_annotation
Python tool to extract highlighted text from a pdf file and write this text into the content of each annotation
annotation docear executable highlight-text pdf pymupdf python
Last synced: 01 Mar 2025
https://github.com/jfriedlein/h2afreeplane_pdf-highlightedtext_to_freeplane_synch
Freeplane script to organise highlighted text and notes from pdf files as Freeplane mindmap
annotation docear freeplane-addon highlight-text notes pdf pymupdf
Last synced: 01 Mar 2025
https://github.com/muneeb1030/finetune-tiny-llama
Fine-tuning the Tiny Llama model to mimic my professor's writing style using the Llama Factory. The project involves data collection, preprocessing, preparation, fine-tuning, and evaluation.
data data-preparation data-preprocessing finetuning llama-factory llm pymupdf selenium-python spacy tinyllama webscraping
Last synced: 28 Dec 2025
https://github.com/elias-jhsph/scienceai
An AI-powered scientific literature search engine that uses OpenAI's language models to analyze research papers. It enables users to extract data, ask complex questions, and perform ad hoc literature reviews, handling hundreds of papers simultaneously without needing metadata.
ai data-extraction dictdatabase flask literature-review llm openai pymupdf research-project research-tool scientific-publications scientific-research
Last synced: 27 Dec 2025
https://github.com/gokulgowthams/askdocs_gen-ai
Private Document Questioning and Answering Application which can answer any question that has been asked, by uploading the desired document the user can ask questions
chromadb deeplearning docx2txt faiss generativeai huggingface langchain llama2 llamaindex openaiembeddings pdfplumber pinecone pymupdf sentencetransformer streamlit
Last synced: 18 Aug 2025
https://github.com/nlqthinh/wibuchatbot
Anime Waifu Chatbot - An AI-powered chatbot with an anime waifu personality! 🌸 Features include chatting, generating anime images, summarizing websites, reading PDFs, retrieving stock prices, and more. Built with Python, Gradio, and OpenAI's API.
anime chatbot diffusers gradio langchain openai pymupdf torch transformers yfinance
Last synced: 03 Aug 2025
https://github.com/tech-c-p/conversai
ConversAI is an innovative conversational AI framework designed for intelligent text extraction and querying across various document formats and web content, leveraging advanced natural language processing techniques.
beautifulsoup chatbot genai gradio groq langchain large-language-models llama3 mlops nlp ocr pymupdf python
Last synced: 01 Sep 2025
https://github.com/vikas-kashyap97/resume-screening
AI-Powered Research Summarizer is a web app that uses Google’s Gemini 1.5 Pro to generate tailored, clear summaries of research papers. It supports PDF uploads, multiple summary styles, and exports to DOCX or PDF.
google-generative-ai langchain-python pymupdf reportlab serper-api streamlit
Last synced: 23 Aug 2025
https://github.com/diocrafts/ai-book-summarizer
📚 AI-Powered Book PDF Knowledge Extractor & Summarizer Transform your PDF books into structured knowledge effortlessly! This tool leverages AI to analyze books page by page, extracting key insights, definitions, and concepts, and organizes them into Markdown summaries for easier study
ai ai-powered-tools automation book-summary document-analysis educational-tools knowledge-extraction machine-learning markdown natural-language-processing openai pdf pdf-processing pdf-summarization pymupdf python study-materials text-analysis text-summarization
Last synced: 14 Jul 2025
https://github.com/renan-siqueira/python-pdf-tool
This project facilitates the extraction of text from PDF files using various Python libraries. It is designed to be flexible, allowing the choice among different text extraction libraries and supporting both single PDF file and directory containing multiple PDF files.
mit-license pdf pdf-extractor pdf-to-text pdfminer pdfplumber pymupdf pypdf2 python
Last synced: 01 Apr 2025
https://github.com/hemaldholakiya12/pdfchat
A web app that allows users to upload PDFs and interact with them through a Q&A interface. The application extracts text from PDFs, generates embeddings, stores them in a FAISS database, and retrieves relevant information to provide context-aware answers using a large language model .
ai api cors embeddings faiss fastapi groq huggingface langchain llama3 llm pdf pdf-processing pymupdf python question-answering semantic-search text-splitting transformers vector-store
Last synced: 30 Oct 2025
https://github.com/boyac/pygamgee
PyGamgee enhances learning and decision-making with a local, open-source language model for efficient, private access.
ai ai-assistants ai-mentor ai-tutor aicpa deepseek embeddings faiss langchain memory nlp ollama on-premise openai pymupdf python self-hosted self-study vectorstore
Last synced: 24 Feb 2025
https://github.com/paolpal/pdfwizard
Toolkit for pdf editing.
bleed bleeding fitz mirror mirror-bleeding mirrorbleed pdf py pymupdf pypdf python
Last synced: 19 Jul 2025
https://github.com/benitomartin/scraping-to-sql
Open Source Contribution to Justicio Project
beautifulsoup fitz mysql pymupdf python requests
Last synced: 20 Feb 2025
https://github.com/ks6088ts-labs/extractor-python
A data extract tool written in Python
fitz gpt-4-vision openai pymupdf
Last synced: 22 Feb 2025
https://github.com/coycs/pdf-streamlit
PDF tools, written with Python, deployed on Streamlit
Last synced: 17 Mar 2025
https://github.com/bilalhameed248/pdf-document-extraction
Python PDF-to-HTML Converter: Transforming PDF Documents into Structured HTML Tags. - Feb 2022 - Jun 2023
document extraction fitz parser parsing pdf pymupdf pymupdf-fitz python python3
Last synced: 08 Oct 2025
https://github.com/jasoncobra3/floorplan-dimractor
A sophisticated Python pipeline for automatically extracting dimensions and cabinet codes from architectural floorplan PDFs. This tool converts various dimension formats into standardized measurements and provides structured output with visualization capabilities.
architecture-tools automation-tools blueprint-analysis cad-automation computer-vision dimension-extraction document-processing document-processing-pipeline floorplan-analysis image-processing measurement-tools opencv pdf-parser pdf-processing pdfplumber pymupdf streamlit text-detection
Last synced: 08 Oct 2025
https://github.com/gerdguerrero/study-chatbot
AI Study Assistant. Upload PDFs, chat with your study materials, and generate practice exams with answer keys. Built with Streamlit, powered by OpenAI GPT-4, and enhanced with RAG (ChromaDB) for intelligent document retrieval.
embeddings gpt-4o gpt4 gpt4o-mini langchain openai openai-api pdfplumber pymupdf pypdf2 python rag rag-chatbot streamlit streamlit-webapp
Last synced: 30 Dec 2025
https://github.com/orengrinker/pdfllm
The PDF Question Answering App uses Streamlit for a user-friendly interface where users can upload PDFs and ask questions. It employs LlamaIndex to index PDF content and PyMuPDF4LLM to parse files, enabling efficient, accurate answers based on the document’s text.
llamaindex openai pymupdf pymupdf4llm python3 streamlit
Last synced: 12 Oct 2025
https://github.com/shefreenkaur/nlp_query_documents
This repository contains two implementations of an NLP document query system that processes PDF documents and ranks them based on relevance to user queries.
easyocr naive-bayes nlp numpy ppmi pymupdf tf-idf
Last synced: 29 Dec 2025
https://github.com/prakshal0809/rag-chatbot
Developed a RAG-based chatbot for seamless integration with an e-hospital platform, enhancing response accuracy by 30% through reliable, trusted medical data sources. Processed over 500+ pages of medical data, enabling real-time symptom analysis and disease suggestions.
javascript langchain openai pinecone pymupdf python reactjs
Last synced: 16 Oct 2025
https://github.com/venkatarangan/productsdigest
A Python-based web scraper that fetches details from specified product webpages, especially Amazon product pages.
amazon beautifulsoup4 pdf-generation pymupdf selenium-python
Last synced: 15 Jun 2025
https://github.com/marek-jakub/siters
A simple .pdf file reader, written in Python.
pdf-viewer pymupdf pyside6 python3 qt6
Last synced: 09 Apr 2025
https://github.com/hreikin/pdf-toolbox
Extract content from PDF's and convert or create new documents from the content in multiple output formats.
adobe document-conversion document-converter document-creation document-creator document-extraction image-extraction pandoc pymupdf pypandoc python python3 scrapy text-extraction
Last synced: 09 Jul 2025
https://github.com/al-shwaib/book-preparation-for-printing
A web application for preparing books and magazines for offset printing. Automatically arranges PDF pages for commercial A3 printing, supporting both Arabic (RTL) and English (LTR) books. تطبيق ويب لتحضير الكتب والمجلات للطباعة على مطابع الأوفست. يقوم تلقائياً بترتيب صفحات PDF للطباعة التجارية على ورق A3، مع دعم الكتب العربية والإنجليزية.
a3-printing arabic-books book-preparation commercial-printing flask-application offset-printing order-to-print pdf-processing pymupdf rtl-support
Last synced: 20 Mar 2025
https://github.com/philippe2023/rag-question-answering-app
An AI-powered Question Answering application that uses Retrieval-Augmented Generation (RAG) to provide accurate and context-aware answers from uploaded PDF documents.
deep-translator langchain ollama pymupdf python3 streamlit transformers
Last synced: 10 Jul 2025
https://github.com/dipanshudhage/crop-and-fertiliser-recommendation-system
The Crop and Fertilizer Recommendation System leverages machine learning to assist farmers in selecting the best crops and fertilizers based on soil nutrient data. By analyzing soil test reports (images/PDFs), the system provides AI-driven recommendations for optimal crop growth and fertilizer use, tailored to the farmer’s specific soil conditions.
machine-learning pymupdf python streamlit tesseract-ocr
Last synced: 07 May 2025
https://github.com/ivan-ayub97/encorpdf_es
EncorPDF Viewer is a sleek and efficient application designed for viewing PDF documents. Tailored for those who simply want to open and navigate PDF files without unnecessary features, distractions, or intrusive ads, it offers a straightforward and hassle-free user experience.
pdf pdf-files pdf-viewer pymupdf pyqt5 pyqt5-desktop-application python python3 viewer
Last synced: 14 Jul 2025
https://github.com/nas-research/knowledge-model
Our knowledge system systematically ingests, processes, and indexes open-access life science publications. It supports internal research by providing precise question-answering and efficient retrieval from a continuously updated repository of scientific literature
accelerate aws boto3 dataingestion keras lifesciences llama llama3 llm numpy pymupdf pytorch researchsupport sqlalchemy tensorflow textextraction
Last synced: 30 Dec 2025
https://github.com/shefreenkaur/web-scraping-and-word-frequencies
This project analyzes word frequencies in BC Legislative documents using Stanford CoreNLP and Python. The program extracts text from PDF documents, processes it using natural language processing techniques, and generates a comprehensive word frequency analysis.
analytics chromedriver easyocr nlp numpy pandas pymupdf python selenium stanfordcorenlp webscraping wordfrequency
Last synced: 28 Mar 2025
https://github.com/amirlogic/pymupdf-webapp
PyMuPDF webapp based on CherryPy
Last synced: 28 Mar 2025
https://github.com/timothy-bartlett/pymupdf
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction text-processing text-shaping xps
Last synced: 17 Mar 2025
https://github.com/rohit-2301/hiresense
HireSense is an AI-powered resume classifier that uses NLP and Machine Learning to predict the best-fit job role from a PDF resume. Built with Streamlit, it features a clean UI for uploading resumes and instantly suggests roles like Data Scientist, Full Stack Developer, and DevOps Engineer.
joblib ml nlp pymupdf python scikit-learn streamlit tfidfvectorizer
Last synced: 22 Jul 2025
https://github.com/esnanta/ai-chatbot-informasi-publik-berbasis-dokumen
Proyek ini merupakan prototipe awal chatbot berbasis AI yang dirancang untuk menyajikan informasi terkait regulasi.
aichatbot chatbot cross-encoder fastapi nltk pymupdf python sentence-transformers uvicorn yii2
Last synced: 29 Mar 2025
https://github.com/jluster96/pdf-package-analyzer
A comprehensive Python tool for analyzing PDF files and determining the best PDF processing library for each file. The analyzer tests PDFs against multiple libraries (pypdf, PyMuPDF, pdfplumber) and provides detailed compatibility reports and recommendations.
analysis analytics compatibility ocr pdf pdf-document pdfplumber pymupdf pypdf pypdf2 python python3 quality testing text-processing text-shaping
Last synced: 05 Oct 2025
https://github.com/mohandshamada/rag-mcp
Rag MCP server for ai documentation
claude-ai langchain mcp-server pymupdf rag
Last synced: 11 Nov 2025
https://github.com/ivan-ayub97/encorpdf
EncorPDF Viewer es una aplicación de visualización eficiente diseñada para ver documentos PDF.
pdf pdf-files pdf-viewer pymupdf pyqt5 pyqt5-desktop-application python python3 viewer
Last synced: 23 Jul 2025
https://github.com/antoniotejada/srdine
Generates enhanced Dungeons and Dragons 5e SRD pdf
dnd dnd5e dungeons-and-dragons dungeons-and-dragons-5e pdf pymupdf srd5 wizards-srd
Last synced: 22 Mar 2025
https://github.com/lingesh81051/similar-template-document-matching-and-fraud-detection
An automated system for a health insurance company to streamline document processing, including template matching and fraud detection, resulting in reduction of processing time.
numpy opencv opencv-python pillow pymupdf pytesseract pytesseract-ocr python tkinter
Last synced: 06 Mar 2025
https://github.com/sdam-au/ocr
Experiments with OCR using Python.
ocr ocr-python pymupdf pytesseract tesseract tool
Last synced: 15 May 2025
https://github.com/ananthakrishnan12/resume-analyzer-using-bert
Resume Analyzer Using BERT
bert-embeddings bert-model cosine-similarity nlp-parsing pdf pdftotext pymupdf python3 spacy-nlp streamlit transformers
Last synced: 15 Mar 2025
https://github.com/esnanta/docu-query
Proyek ini merupakan prototipe awal chatbot berbasis AI yang dirancang untuk menyajikan informasi terkait regulasi.
aichatbot chatbot cross-encoder fastapi nltk pymupdf python sentence-transformers uvicorn yii2
Last synced: 31 Mar 2025
https://github.com/olonok69/nim_llamaindex
Integracion LLamaIndex with NVIDIA NIM
llamaindex nvidia-nim pymupdf pymupdf4llm python rag streamlit
Last synced: 24 Mar 2025
https://github.com/terry-li-hm/prometheus
PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude
ai-tools claude-code document-processing fastmcp mcp-server pdf-processing pdf-splitter prometheus pymupdf python text-extraction
Last synced: 03 Sep 2025
https://github.com/vaidehishyara14/ayurveda-pdf-q-a-chatbot
An intelligent chatbot that allows users to upload text-based Ayurveda PDFs and ask questions based on the content using RAG (Retrieval-Augmented Generation) combining semantic search and LLM-based responses.
embeddings fastapi fiass langchain langchain-groq llama3 llm pdf pdfprocessing pymupdf python question-answering text-splitting vector-database
Last synced: 03 Jul 2025
https://github.com/hase3b/scprag
This repository implements a Retrieval-Augmented Generation (RAG) system for the Supreme Court of Pakistan, utilizing different LLMs, embedding models, and retrieval and generation enhancement strategies. It processes SCP judgments, applies chunking, and generates legal summaries and answers based on relevant case data.
beautifulsoup4 embedding-models huggingface langchain legal-corpus llama llm mistral nlp ocr pdf2image pinecone pymupdf pytesseract regex retreival retrieval-augmented-generation selenium vectorstore
Last synced: 31 Dec 2025