Projects in Awesome Lists tagged with pdf-parser
A curated list of projects in awesome lists tagged with pdf-parser .
https://github.com/opendatalab/mineru
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 06 Jan 2026
https://github.com/opendatalab/MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 24 Mar 2025
https://github.com/py-pdf/pypdf
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
help-wanted pdf pdf-documents pdf-manipulation pdf-parser pdf-parsing pypdf2 python
Last synced: 11 Dec 2025
https://github.com/py-pdf/PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
help-wanted pdf pdf-documents pdf-manipulation pdf-parser pdf-parsing pypdf2 python
Last synced: 17 Aug 2025
https://github.com/mstamy2/PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
help-wanted pdf pdf-documents pdf-manipulation pdf-parser pdf-parsing pypdf2 python
Last synced: 02 Apr 2025
https://github.com/dromara/yft-design
基于fabric.js的开源版【稿定设计】。一款美观且功能强大的在线设计工具,具备海报设计和图片编辑功能。适用于多种场景,如海报生成、电商产品图制作、文章长图设计、视频/公众号封面编辑等 。A beautiful and powerful online design tool
canvas-editor clipper element-plus fabric-editor fabricjs image-crop online-design online-editor pdf-editor pdf-parser poster-design psd-editor psd-parse text2path vue3-fabric
Last synced: 15 May 2025
https://github.com/adithya-s-k/marker-api
Easily deployable 🚀 API to convert PDF to markdown quickly with high accuracy.
api fastapi marker pdf-converter pdf-files pdf-parser pdf-parsing rest-api
Last synced: 16 May 2025
https://github.com/drmingler/docling-api
Easily deployable and scalable backend server that efficiently converts various document formats (pdf, docx, pptx, html, images, etc) into Markdown. With support for both CPU and GPU processing, it is Ideal for large-scale workflows, it offers text/table extraction, OCR, and batch processing with sync/async endpoints.
api fastapi markdown-parser pdf-chatbot pdf-conversion pdf-converter pdf-parser pdf-parsing pdf-to-markdown
Last synced: 06 Oct 2025
https://github.com/iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
document-parser pdf-parser pdf-to-markdown text-extraction
Last synced: 13 Dec 2025
https://github.com/titipata/scipdf_parser
Python PDF parser for scientific publications: content and figures
grobid parser pdf pdf-parser python-parser scipdf-parser
Last synced: 16 May 2025
https://github.com/michelcrypt4d4mus/pdfalyzer
Analyze PDFs. With colors. And Yara.
malicious-pdf-files malware-analysis pdf pdf-documents pdf-format pdf-parser
Last synced: 21 Oct 2025
https://github.com/lazyFrogLOL/llmdocparser
A package for parsing PDFs and analyzing their content using LLMs.
chunking document-analysis llm nlp ocr pdf-parser pdfparser rag text-chunking
Last synced: 01 Apr 2025
https://github.com/ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
doc document-analysis document-content-extraction documents docx docx-parser excel html html-parser logical-structure-extraction ocr odt pdf pdf-parser scanned-documents table-of-contents table-recognition txt
Last synced: 15 May 2025
https://github.com/sypht-team/sypht-python-client
A python client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser pdf-parser python python3 python3-library receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-python-client
Last synced: 11 Jul 2025
https://github.com/sypht-team/sypht-java-client
A Java client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-retrieval information-retrieval-engine invoice invoice-parser java java8 pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-java-client
Last synced: 10 Apr 2025
https://github.com/drmingler/smart-llm-loader
smart-llm-loader is a lightweight yet powerful Python package that transforms any document into LLM-ready chunks. Spend less time on preprocessing headaches and more time building what matters. From RAG systems to chatbots to document Q&A, SmartLLMLoader handles the heavy lifting so you can focus on creating exceptional AI applications.
chatbot chunking claude gemini langchain llama-index markdown openai pdf-converter pdf-parser pdf-to-markdown rag
Last synced: 31 Jul 2025
https://github.com/genbs/poste-italiane-parser
A Python tool to parse PDF statements from Poste Italiane (Postepay, BancoPosta) and extract data as structured JSON.
bancoposta fintech pdf-parser personal-finance poste-italiane postepay
Last synced: 31 Oct 2025
https://github.com/ashutoshvarma/pyxpdf
Fast and memory-efficient Python PDF Parser based on xpdf sources
cython pdf pdf-converter pdf-parser pdfparser pdftohtml pdftopng pdftotext python xpdf xpdf-reader
Last synced: 13 Jul 2025
https://github.com/SimpleApp/PDFParser
Swift PDFParser for PDF parsing and text mining. Includes a TrueType font parser
Last synced: 21 Jul 2025
https://github.com/sypht-team/sypht-golang-client
A Golang client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields go golang golang-library golang-package invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-golang-client
Last synced: 20 Oct 2025
https://github.com/lianjiatech/bella-domify
文档解析(Document Parser),支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式,高效提取与解析内容,生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser,助力 RAG、知识库、全文检索等智能应用。
document-parser parser pdf-parser
Last synced: 04 Oct 2025
https://github.com/sylphxltd/pdf-reader-mcp
An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.
ai-agent llm-tool mcp model-content-protocol nodejs pdf pdf-parse pdf-parser pdf-reader stdio typescript
Last synced: 17 Jun 2025
https://github.com/tarfin-labs/easy-pdf
Pdf wrapper for laravel
laravel pdf pdf-merge pdf-parser php tcpdf
Last synced: 17 Mar 2025
https://github.com/adrienjoly/hsbcstatementparser
Transforms PDF bank statements from HSBC into a list of operations in JSON or TSV format.
bank-statement conversion csv-export json-export pdf-converter pdf-parser tsv-format
Last synced: 13 Jul 2025
https://github.com/nlitsme/pypdfcrack
Investigation in PDF encryption
file-format pdf-encryption pdf-parser reverse-engineering
Last synced: 02 Aug 2025
https://github.com/sypht-team/sypht-node-client
A Nodejs client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields invoice invoice-parser node node-module nodejs nodejs-client pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-node-client
Last synced: 13 Apr 2025
https://github.com/sypht-team/sypht-kotlin-client
A Kotlin client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser kotlin kotlin-android kotlin-library pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-kotlin-client
Last synced: 27 Jul 2025
https://github.com/vishwagauravin/pdf-parser-client-side
A lightweight easy to use package to parse text from PDF files on client side without any server dependency.
client-side pdf pdf-parser pdf-reader pdfjs
Last synced: 08 Apr 2025
https://github.com/easonlai/chat_with_pdf_table
The contents of this repository showcase how to extract table data from a PDF file and preprocess it to facilitate word embedding. This preprocessing step enhances the readability of table data for language models and enables us to extract more contextual information from the tables.
azure-openai chroma chromadb embedding-models embedding-vectors embeddings langchain langchain-python pdf pdf-document-processor pdf-parser pdf-parsing python word-embeddings
Last synced: 25 Jun 2025
https://github.com/ashutoshvarma/libxpdf
Static library built from source of www.xpdfreader.com with most of dependencies built within
cplusplus cpp-library pdf pdf-parser pdf-viewer-component xpdf xpdf-reader
Last synced: 12 Apr 2025
https://github.com/aidayang/mineru-oneclick
MinerU免安装部署一键启动整合包
ai4science document-analysis extract-data layout-analysis markdown mineru ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser pdftojson pdftomarkdown python
Last synced: 12 Jul 2025
https://github.com/sypht-team/sypht-elixir-client
An Elixir client for the Sypht API https://sypht.com
api-client data-extraction document-capture elixir elixir-lang extract extract-data extract-fields information-retrieval information-retrieval-engine invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-api-elixir
Last synced: 13 Apr 2025
https://github.com/sypht-team/sypht-csharp-client
A C# / .NET client for the Sypht API
api-client data-extraction document-capture dot-net dotnet dotnet-cli dotnet-library extract extract-data-from-pdf extract-fields invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-csharp-client
Last synced: 13 Apr 2025
https://github.com/sidmishraw/cs-267-project
PDF-Parser and Apriori and Simplical Complex algorithm implementations
apriori-algorithm data-mining-algorithms pdf pdf-json pdf-parser text-mining
Last synced: 12 Apr 2025
https://github.com/sypht-team/sypht-ruby-client
A Ruby client for the Sypht API
api-client data-extraction document-capture extract extract-data-from-pdf extract-fields information-retrieval invoice invoice-parser pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning ruby ruby-gem ruby-library sypht sypht-api sypht-ruby-client
Last synced: 05 Mar 2025
https://github.com/j-sephb-lt-n/pdf-bank-statement-parser
Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data
bank banking document-parsing financial-analysis first-national-bank fnb pdf-parser pdf-parsing python
Last synced: 31 Aug 2025
https://github.com/cschen1205/spring-pdf-search-engine
PDF Search Engine implemented in Java and Spring Boot
elastic-search pdf-parser pdf-upload search-engine spring-boot
Last synced: 12 Oct 2025
https://github.com/aleff-github/pdf-parser-virustotal-based
PDF Parser based on VirusTotal API
pdf pdf-parser python python3 virustotal virustotal-api virustotal-parser virustotal-pdf-parser virustotal-python
Last synced: 28 Apr 2025
https://github.com/sylphlab/pdf-reader-mcp
An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.
ai-agent llm-tool mcp model-content-protocol nodejs pdf pdf-parse pdf-parser pdf-reader stdio typescript
Last synced: 11 Apr 2025
https://github.com/pspdfkit/nutrient-pdf-mcp-server
A powerful Model Context Protocol server for LLM-driven PDF document analysis and exploration
ai-integration llm-tools mcp pdf pdf-parser pdf-tools python
Last synced: 05 Sep 2025
https://github.com/eli64s/pdflex
CLI for merging PDF contexts.
pdf-automation pdf-converter pdf-data-extraction pdf-document pdf-document-parser pdf-document-processor pdf-extractor pdf-generator pdf-library pdf-manipulation pdf-parser pdf-processor pdf-python pdf-regex pdf-search pdf-text-extraction pdf-tools python-pdf python-pdf-tools
Last synced: 07 Oct 2025
https://github.com/flazefy/gudangku-laravel
GudangKu helps you manage your belongings, from home supplies and food stock to furniture. Set reminders to remind you to cleaning or maybe time to restocking some of your home supplies. In this apps also have generate reports to create shopping or maintenance list. Start organizing your inventory with GudangKu’s features. Created using Laravel
api-testing cronjob csv-export firebase firebase-storage integration-testing laravel mailer migrations mysql pdf pdf-parser php rest-api seeding statistics swagger task-scheduler telegram-bot unit-testing
Last synced: 01 Jul 2025
https://github.com/hfrewreeft/pdf-reader-mcp
An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.
ai-agent llm-tool mcp model-content-protocol nodejs pdf pdf-parse pdf-parser pdf-reader stdio typescript
Last synced: 29 Jun 2025
https://github.com/shtse8/pdf-reader-mcp
An MCP server built with Node.js/TypeScript that allows AI agents to securely read PDF files (local or URL) and extract text, metadata, or page counts. Uses pdf-parse.
ai-agent llm-tool mcp model-content-protocol nodejs pdf pdf-parse pdf-parser pdf-reader stdio typescript
Last synced: 05 Apr 2025
https://github.com/bratergit/hacktoberfest2020
Hacktoberfest 2020 - Faça um programa desktop que rode no terminal que dado um pdf da toro investimentos com as corretagens do dia. Mostre o Cálculo do Imposto de Renda para day trade do mini dolar e mini índice da bovespa.
bovespa hacktoberfest hacktoberfest2020 javascript nodejs pdf-parser toroinvestimentos
Last synced: 05 Apr 2025
https://github.com/siddhantsingh1230/snapcv
A Simple NLP Web App to create summaries of your CVs
nlp node pdf-parser react summarizer
Last synced: 12 Oct 2025
https://github.com/saviobatista/vitae
AI-powered résumé transformer: match your CV to any job and export in LaTeX PDF.
ai-resume career-tools document-processing job-applications latex openai oss pdf-parser resume-builder tailored-resume typescript vercel
Last synced: 07 Sep 2025
https://github.com/aqiftekhar/openaichatbot
This is a healthcare Chatbot implemented using Open AI that also recieve PDF Documents and Images and prescribe based on summary
nextjs openai openai-api pdf-parser react tesseract vision
Last synced: 31 Dec 2025
https://github.com/syedaliwaqar12/resume-parser
🚀 A beautiful, production-ready web app that extracts structured data from PDF resumes using AI and NLP. Built with React + TypeScript + FastAPI.
ai-resume-analysis-job-matching-ml-cv-parser automation document-processing fastapi file-upload heroku-deployment job-application-tool netlify-deployment nlp-processing pdf-parser python-api reactjs resume-parser spacy-nlp tailwind-css text-automation text-extraction typescript web-application
Last synced: 30 Dec 2025
https://github.com/luccahirae/invoice-extract-server
API para extração de dados de faturas
express jest multer nodejs pdf-parser prisma
Last synced: 09 Apr 2025
https://github.com/petermosmans/apdfhelper
Fix links in PDF files, rewrite links, extract text annotations, remove pages
annotations calendar pdf pdf-converter pdf-extractor pdf-parser planner
Last synced: 16 Mar 2025
https://github.com/jasoncobra3/floorplan-dimractor
A sophisticated Python pipeline for automatically extracting dimensions and cabinet codes from architectural floorplan PDFs. This tool converts various dimension formats into standardized measurements and provides structured output with visualization capabilities.
architecture-tools automation-tools blueprint-analysis cad-automation computer-vision dimension-extraction document-processing document-processing-pipeline floorplan-analysis image-processing measurement-tools opencv pdf-parser pdf-processing pdfplumber pymupdf streamlit text-detection
Last synced: 08 Oct 2025
https://github.com/siddhantsingh1230/snapcv_backend
A Node Backend Server for SnapCV
expressjs node-nlp nodejs pdf-parser react
Last synced: 12 Oct 2025
https://github.com/jogemu/pdf2tree
Parse PDF and group elements based on enclosing lines. A node.js module that promisifies the pdf2json parser and structures the data in a way that is suitable for tables with merged cells.
data-table hierarchical-data merged-table-cells pdf-parser tree-structure
Last synced: 13 Oct 2025
https://github.com/byerlikaya/smartrag
SmartRAG is a production-ready .NET 9.0 library that provides a complete Retrieval-Augmented Generation (RAG) solution. Features include multi-provider AI support (OpenAI, Anthropic, Gemini), enterprise vector storage (Qdrant, Redis, SQLite), and intelligent document processing (PDF, Word, Text).
ai anthropic csharp document-processing document-qa dotnet enterprise-ai gemini llm machine-learning natural-language-processing openai pdf-parser qdrant rag redis retrieval-augmented-generation vector-database word-parser
Last synced: 27 Dec 2025
https://github.com/vinayaksandilya/notebook-front-end
Turn any PDF into a structured online course with modules, summaries, and key takeaways — powered by Node.js, MySQL, and AI models like GPT-4 & Claude.
ai claude course-generator education-tech fullstack gpt openai pdf-parser
Last synced: 08 Jul 2025
https://github.com/nihal-soni/summerify
Ai tool for summarizing -pdf into short notes
bun framer-motion full-stack-application langcahin nextjs15 openai pdf-parser react shadncui stripe tailwind uploadthing
Last synced: 31 Aug 2025
https://github.com/patrixshah/resumescreening
Resume Screening: An AI Driven User Profile Screening Tool
chatgpt3 express jest mammoth multer nodejs openai pdf-parser typescript
Last synced: 30 Dec 2025
https://github.com/kelvinleandro/ufc-ira-calculator
Aplicação com Streamlit que calcula o Índice de Rendimento Acadêmico (IRA)
analytics dashboard data-visualization docker dockerfile github-actions google-drive-api pandas pdf-parser pdf-parsing plotly plotly-express postgres psycopg2 python python3 sql streamlit streamlit-webapp
Last synced: 16 Sep 2025
https://github.com/dills122/cardboard-crack
Web app for parsing/viewing Soccer Card Checklists
angular pdf-parser primeng soccer sports-cards
Last synced: 23 Mar 2025
https://github.com/souravupadhyay7/morvs_chat_bot
🤖 MORVS AI - An intelligent chat interface powered by Groq's LLaMA 3 model with PDF processing capabilities. Built with Next.js, React, TypeScript, and modern UI components.
ai-assistant ai-chatbot chat-interface conversational-ai cyberpunk-ui framer-motion groq nextjs pdf-parser pdf-processing real-time-chat shadcn-ui tailwindcss typescript
Last synced: 18 Aug 2025
https://github.com/sankeer28/pdf-searcher
Live website to Parses multiple PDFs using PDF.js
Last synced: 27 Dec 2025
https://github.com/saniyaacharya04/resume-scanner-using-nlp
A live resume scanning and ranking tool built with Python, Streamlit, and NLP. Upload resumes, match them to job descriptions, and generate analytics dashboards and PDF reports.
dashboard job-matching nlp pdf-parser resume-scanner scikit-learn spacy streamlit transformers
Last synced: 31 Oct 2025
https://github.com/chinmaymisra/personal-finance-tracker
Upload Axis Bank statements as PDFs, automatically parse transactions, and view them cleanly in a modern UI. Handles invalid files and non-supported banks gracefully. Built using React (Vite) and FastAPI.
axis-bank bank-statement fastapi financial-application fullstack pdf-parser python react typescript vite
Last synced: 30 Dec 2025
https://github.com/sourik-10/prismai
QuickAI is a full-stack AI web application built with a modular client–server architecture. The project is primarily developed in JavaScript, with the frontend and backend kept in separate folders for better structure and scalability. It leverages modern web technologies and integrates AI-powered features to deliver intelligent interactions.
axios clerk clipdrop-api cloudinary cors dotenv expre gemini-api multer neon nodejs pdf-parser react react-router-dom tailwindcss toaster
Last synced: 30 Dec 2025
https://github.com/fayazk/document-metadata-extractor
A Python tool that uses Google's Gemini AI to automatically extract structured metadata from PDF and DOCX documents, saving results to Excel for easy analysis and organizing raw responses as JSON files.
content-indexing data-extraction document-management document-processing docx-parser excel-export gemini-ai-project generative-ai json-output metadata-extraction nlp pdf-parser python-automation text-analysis
Last synced: 01 Apr 2025
https://github.com/sypht-team/sypht-clojure-client
A clojure client for the Sypht API
api-client clojure computer-vision data-extraction document-capture extract extract-data-from-pdf extract-fields information-extraction invoice invoice-parser machine-learning pdf-parser receipt-capture receipt-reader receipt-scanner receipt-scanning sypht sypht-api sypht-clojure-client
Last synced: 05 Sep 2025