An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with document-processing

A curated list of projects in awesome lists tagged with document-processing .

https://github.com/enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python

Last synced: 04 Apr 2025

https://github.com/enoch3712/extractthinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python

Last synced: 14 May 2025

https://github.com/dhlab-epfl/dhSegment

Generic framework for historical document processing

document-processing historical-data python3 segmentation tensorflow

Last synced: 15 Mar 2025

https://github.com/iamarunbrahma/pdf-to-markdown

Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

document-conversion document-processing information-retrieval pdf-converter pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction

Last synced: 10 Apr 2025

https://github.com/pspdfkit/nutrient-document-engine-mcp-server

A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.

agentic-ai document-processing document-processor mcp-server

Last synced: 05 Sep 2025

https://github.com/aws-solutions/enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

document-analysis document-processing

Last synced: 17 Jul 2025

https://github.com/cburschka/lyx

Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)

document-processing latex lyx mirror

Last synced: 05 May 2025

https://github.com/jmanhype/dspy-multi-document-agents

An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.

ai distributed-systems document-processing knowledge-management nlp query-optimization vector-search

Last synced: 13 Apr 2025

https://github.com/martin-papy/qdrant-loader

Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.

cli-tool confluence-integration cursor-ide developer-tools document-processing embbedings enterprise-ready file-conversion git-integration jira-integration knowledge-base llm-integration mcp-server multi-project openai python rag semantic-search

Last synced: 16 Dec 2025

https://github.com/greed2411/tokyo

tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing

Last synced: 07 May 2025

https://github.com/eklem/stopword-trainer

A module for creating stopword lists for any language, based on a set of documents.

document-processing information-retrieval nlp stopwords stopwords-removal

Last synced: 05 Jul 2025

https://github.com/b-a-m-n/flockparser

Distributed document RAG system with intelligent GPU/CPU orchestration. Auto-discovers heterogeneous nodes, routes workloads adaptively, and achieves 60x+ speedups through VRAM-aware load balancing. Privacy-first architecture with 4 interfaces (CLI, API, MCP, Web UI). Real distributed systems engineering, not just an API wrapper.

api auto-discovery chromadb cli distributed-rag document-processing gpu-orchestration heterogeneous-computing llm load-balancing mcp ollama privacy-first python rag semantic-search vector-database vram-aware web-ui workload-orchestration

Last synced: 14 Dec 2025

https://github.com/vakharwalad23/mark-minion

The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.

ai-powered cloudflare-worker content-extraction document-processing markdown-conversion puppeteer tweets-extraction typescript web-scraping

Last synced: 18 Jun 2025

https://github.com/h0neyp0t-466/pen2pdf

"📝 Pen2PDF – AI-powered web app to transform handwritten notes, slides, PDFs & images into editable Markdown ✏️ → export as polished PDFs 📄. Features drag & drop 📤, real-time editing ⚡, responsive UI 📱, and Google Gemini 🤖 integration. Perfect for students, creators & pros 🚀."

ai-app ai-text-extraction document-processing express file-converter google-gemini handwritten-notes javascript markdown-editor nodejs ocr pdf-converter pdf-to-markdown pdf-tools pen2pdf ppt-to-pdf react text-extraction vite web-app

Last synced: 25 Sep 2025

https://github.com/aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai

This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.

aws document-processing fine-tuning huggingface idp llama multimodal qwen2-vl sagemaker sft swift

Last synced: 03 Oct 2025

https://github.com/saviobatista/vitae

AI-powered résumé transformer: match your CV to any job and export in LaTeX PDF.

ai-resume career-tools document-processing job-applications latex openai oss pdf-parser resume-builder tailored-resume typescript vercel

Last synced: 07 Sep 2025

https://github.com/jromero132/pdf-merger

A Python utility for merging multiple PDFs and images into a single PDF file. This tool maintains aspect ratios, centers content on custom-sized pages (default A4), and supports recursive directory processing. Perfect for organizing documents and creating cohesive PDF compilations.

aspect-ratio command-line-tool content-center cross-platform custom-page directory-recursive document-management document-processing file-conversion file-organization image-processing image-to-pdf multi-format-support open-source pdf-merger pdf-tools productivity-tool python python-utility python3

Last synced: 03 Apr 2025

https://github.com/patteg21/pigeon-evals

A End-To-End RAG Pipeline that includes Evaluations, iterations, and swappable components. At its core it allows users to be able to try different embedding models and techniques.

benchmarking document-processing embeddings evaluation llm mcp nlp pipeline processing python rag retrieval-augmented-generation vector-database vector-search

Last synced: 17 Sep 2025

https://github.com/hasnaintypes/lawbotics-v2

LawBotics v2 is an AI-powered legal contract analysis platform that combines machine learning with modern web technologies to automate legal document review and clause extraction.

ai authentication clerk convex cuad-dataset document-processing fine-tuning full-stack langchain legal-document-analyzer legal-tech monorepo nextjs shadcn tailwindcss typescript

Last synced: 17 Aug 2025

https://github.com/x1ao4/doc-merger

通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script

data-analysis data-merging document-analysis document-comparison document-processing documents filtering filtering-data merge merge-documents

Last synced: 28 Jun 2025

https://github.com/natgluons/AI-docs-analyzer-API

A smart document processing system built with an open-source multimodal LLM and OCR (DocTR/TrOCR), using FastAPI, Supabase, PgVector, Azure Functions, and Neo4j to automate invoice analysis and identity document verification.

document-processing document-verification fastapi llm multimodal-large-language-models neo4j-graph ocr ocr-text-reader pgvector supabase

Last synced: 19 Jun 2025

https://github.com/trsdn/mistraldocai-mcp

MCP (Model Context Protocol) server for document-to-Markdown conversion using Mistral AI OCR. Compatible with Claude Desktop and other MCP clients.

claude-desktop document-processing markdown mcp-server mistral-ai ocr pdf-converter python typescript

Last synced: 25 Sep 2025

https://github.com/mancrurod/resume-optimization

​Resume-Optimization automates resume enhancement using AI by converting .docx resumes into Markdown, tailoring them to specific job descriptions, and exporting the results in HTML and PDF formats.

automation career-development document-processing gpt-integration job-matching markdown-to-html natural-langauge-processing pdf-generation python resume-optimization resume-parser solid-principles

Last synced: 08 Apr 2025

https://github.com/ahnafnafee/local-llm-pdf-ocr

Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI & CLI.

document-processing fastapi local-llm no-api-key ocr offline-ai olmocr pdf-ocr privacy-focused python searchable-pdf surya-ocr vision-llm web-ui

Last synced: 25 Dec 2025

https://github.com/cerno-ai/cerno-insight

High-performance RAG system for intelligent document Q&A with hybrid retrieval, GPU acceleration, and citation-backed answers. Upload docs, ask questions, get precise responses.

artificial-intelligence bm25 docker document-processing embeddings faiss fastapi llms local-first machine-learning natural-language-processing nextjs openai python rag rag-pipeline reranking retreival-augmented-generation semantic-search typescript

Last synced: 08 Nov 2025

https://github.com/thoth2357/watermark-removal

Program Helps remove watermark from a pdf document

document-processing watermarking

Last synced: 10 Oct 2025

https://github.com/jdm-github/debahra-efficio

DEHBARA (Efficio) is a React and Express-based web application designed to streamline service requests for DTI, SSS, and other document processing needs. It simplifies the process of requesting official papers and services, integrating cloud storage for efficient data management.

cloud-database document-processing dti express government-services react sss web-application

Last synced: 14 Apr 2025

https://github.com/quarkiverse/quarkus-docling

Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem

ai docling document-processing embedding quarkus quarkus-extension rag

Last synced: 11 Jul 2025

https://github.com/acsenrafilho/cucaracha

A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis

document-analysis document-classification document-processing optical-character-recognition python3

Last synced: 28 Oct 2025

https://github.com/roberto-a-cardenas/intellidoc-engine

Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.

api-gateway aws aws-lambda cloud-engineering document-processing ocr s3 serverless terraform textract

Last synced: 01 Jul 2025

https://github.com/node0/timbermill

OCR-powered chat session renderer that slices long conversations into paginated, searchable PDFs

chat-archive chatgpt cv2 document-processing llm-tools ocr pdf-generation python

Last synced: 17 Apr 2025

https://github.com/guiss-guiss/scriptumai

RAG Application ScriptumAI is an advanced Retrieval-Augmented Generation platform designed for document ingestion, semantic search, and query processing.

ai document-ingestion document-processing file-upload flask language-model llama llm machine-learning multi-language nlp offline ollama pdf-processing private python rag retrieval-augmented-generation semantic-search text-analysis

Last synced: 28 Mar 2025

https://github.com/oeo/processor-rs

High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.

document-processing image-optimization parallel-processing rust tesseract-ocr text-extraction

Last synced: 10 Jun 2025

https://github.com/terry-li-hm/prometheus

PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude

ai-tools claude-code document-processing fastmcp mcp-server pdf-processing pdf-splitter prometheus pymupdf python text-extraction

Last synced: 03 Sep 2025

https://github.com/resetnetwork/n8n-nodes

A collection of custom n8n nodes for enhanced document processing, text splitting, and embeddings generation

ai document-processing embeddings langchain monorepo n8n n8n-community-nodes text-splitting typescript

Last synced: 11 Jun 2025

https://github.com/debugger404/rag-powered-gpt-4-chatbot

🚀 Revolutionize your data interaction with a cutting-edge chatbot built on Retrieval-Augmented Generation (RAG) and OpenAI’s GPT-4. Upload documents, create custom knowledge bases, and get precise, contextual answers. Ideal for research, business operations, customer support, and more!

ai-chatbot ai-powered-chatbot azure-openai business-chatbot custom-knowledge-base customer-support-chatbot document-chatbot document-processing gpt-4 knowledge-management knowledge-retrieval machine-learning-chatbot natural-language-processing openai pdf-search rag research-chatbot retrieval-augmented-generation semantic-search vector-database

Last synced: 07 Aug 2025

https://github.com/jcaperella29/document_cleaning_cli

A deep learning-based pipeline for cleaning scanned document images. Automatically removes noise, enhances text clarity, and optimizes images for OCR. 🚀

cli-tool computer-vision deep-learning denoising document-processing image-cleaning image-processing ocr pytesseract python scanned-documents

Last synced: 27 Nov 2025

https://github.com/fayazk/document-metadata-extractor

A Python tool that uses Google's Gemini AI to automatically extract structured metadata from PDF and DOCX documents, saving results to Excel for easy analysis and organizing raw responses as JSON files.

content-indexing data-extraction document-management document-processing docx-parser excel-export gemini-ai-project generative-ai json-output metadata-extraction nlp pdf-parser python-automation text-analysis

Last synced: 01 Apr 2025

https://github.com/samay-jain/voice_assistant_rag_system_using_langchain_and_streamlit

Voice Assistant RAG System using LangChain, Whisper, and Streamlit - A voice-enabled assistant that lets you ask questions by speaking, processes your custom documents, and responds with natural speech. Built with LangChain, Ollama, Whisper, ElevenLabs, and Streamlit.

ai-assistant document-processing elevenlabs faiss langchain llm ollama python rag retrieval-augmented-generation speech-recognition streamlit text-to-speech voice-assistant whisper

Last synced: 23 Jul 2025

https://github.com/maemresen/mae-ghostscript

mae-ghostscript is a Docker-based tool for compressing PDF files efficiently using Ghostscript. This containerized solution simplifies the process of PDF compression, providing a consistent environment that works across different platforms. Users can run the container by mounting their local directories and specifying the PDF to compress.

bash-scripting containerized-application docker document-processing ghostscript pdf-compression

Last synced: 01 Mar 2025

https://github.com/terilios/file-upload-embeddings

Enterprise-grade document intelligence platform leveraging vector embeddings and LLMs for advanced document processing, semantic search, and information retrieval.

artificial-intelligence docker document-processing enterprise-software fastapi machine-learning natural-language-processing python semantic-search vector-embeddings

Last synced: 16 Mar 2025

https://github.com/jasoncobra3/floorplan-dimractor

A sophisticated Python pipeline for automatically extracting dimensions and cabinet codes from architectural floorplan PDFs. This tool converts various dimension formats into standardized measurements and provides structured output with visualization capabilities.

architecture-tools automation-tools blueprint-analysis cad-automation computer-vision dimension-extraction document-processing document-processing-pipeline floorplan-analysis image-processing measurement-tools opencv pdf-parser pdf-processing pdfplumber pymupdf streamlit text-detection

Last synced: 08 Oct 2025

https://github.com/theogyeezy/rag-multi-agent-template

RAG enabled multi agent template using CrewAI and WatsonxAI. Supports ChromaDB, FAISS, Pinecone with document processing for PDF/DOCX/TXT. Includes legal, technical, and customer support examples.

agent ai crewai document-processing knowledge-base langchain multi-agent multiagent multiagenttemplate nlp python rag ragtemplate template vector-database watsonx watsonxai

Last synced: 10 Oct 2025

https://github.com/josh-janse/pdf-to-markdown-extractor

Convert PDF documents to clean markdown using Google's Gemini API.

ai document-processing gemini-api markdown nodejs pdf text-extraction text-extraction-from-pdf

Last synced: 18 Jun 2025

https://github.com/sdpdas/document_annotate_tool

Adds annotation to each element in document and defines what it is.

document-processing python python-docx xml

Last synced: 05 Oct 2025

https://github.com/kaptinka/-gigtakaful-ai-insurance

Advanced AI fraud detection for Takaful motor insurance claims. Automate analysis of police reports and estimates with OCR and real-time analytics. 🚀💻

ai document-processing fraud-detection insurance machine-learning mongodb nlp ocr python real-time-analytics streamlit takaful xgboost

Last synced: 28 Jun 2025

https://github.com/kazkozdev/researchify

🔬 Scientific chatbot that instantly searches arXiv.org papers, transforming an ocean of preprints into clear research insights. Powered by local LLMs from Ollama.

academic-tools api artificial-intelligence arxiv chatbot document-analysis document-processing llm machine-learning nlp nlp-machine-learning ollama paper-search rag research-assistant research-tools scientific scientific-computing scientific-papers

Last synced: 05 Apr 2025

https://github.com/msaleme/mulesoft-idp-projects

🤖 MuleSoft Intelligent Document Processing (IDP) Projects - Automated document processing workflows with Salesforce & NetSuite integrations. Features purchase order processing, driver license extraction, and comprehensive error handling. Built with MuleSoft 4.6.2 & API-led connectivity patterns.

ai-ml api-led-connectivity automation dataweave document-processing enterprise-integration idp intelligent-document-processing java maven mulesoft netsuite salesforce

Last synced: 29 Jul 2025

https://github.com/syncfusion/document-sdk-blazor-demos

Explore the Syncfusion Blazor demos featuring our advanced PDF, Word, Excel, and PowerPoint document processing libraries.

blazor document-processing excel pdf powerpoint word

Last synced: 24 Sep 2025

https://github.com/bneweling/neuronode

🧠 Neuronode - Enterprise-grade Knowledge Management System with LiteLLM, Neo4j, and Vector Search. AI-powered document processing, intelligent relationship discovery, and advanced query orchestration.

ai document-processing enterprise knowledge-management litellm llm neo4j python typescript vector-search

Last synced: 23 Oct 2025

https://github.com/e-candeloro/credem_hack_2025

AI-powered document processing pipeline for Credem Hackathon 2025. Leverages Google Cloud AI services to intelligently extract, classify, and process HR documents through a robust ETL pipeline.

ai document-processing googlecloudplatform hackathon llm prompt-engineering python

Last synced: 04 Aug 2025

https://github.com/haasonsaas/gpt-oss-agent

Privacy-first AI agent system using local GPT-OSS models. Combines intelligent knowledge management with natural language file operations.

ai document-processing file-management gpt-oss knowledge-base local-ai natural-language ollama privacy python rag semantic-search

Last synced: 06 Aug 2025

https://github.com/chayannfamali/autohr

AI-платформа автоматизации рекрутинга с интеграцией искусственного интеллекта для анализа резюме и оценки кандидатов. Система автоматически обрабатывает документы в форматах PDF/DOCX, извлекает навыки и опыт, а затем сопоставляет их с требованиями вакансий.

ai bootstrap django django-framework django-project document-processing hr-system machine-learning nlp python recruitment transformers

Last synced: 10 Aug 2025

https://github.com/aidalinfo/extract-kit

Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

ai-sdk document-processing pdf pdf-extraction vision-llm

Last synced: 15 Aug 2025

https://github.com/ma3u/neo4j-agentframework

🤖 Advanced AI Agent Framework for Neo4j Knowledge Graphs - Build intelligent agents that understand, analyze, and interact with graph databases through natural language. 417x faster than basic implementations.

ai-agents cypher docker document-processing graph-database knowledge-graph machine-learning neo4j python rag semantic-search vector-embeddings

Last synced: 05 Oct 2025

https://github.com/byerlikaya/smartrag

SmartRAG is a production-ready .NET 9.0 library that provides a complete Retrieval-Augmented Generation (RAG) solution. Features include multi-provider AI support (OpenAI, Anthropic, Gemini), enterprise vector storage (Qdrant, Redis, SQLite), and intelligent document processing (PDF, Word, Text).

ai anthropic csharp document-processing document-qa dotnet enterprise-ai gemini llm machine-learning natural-language-processing openai pdf-parser qdrant rag redis retrieval-augmented-generation vector-database word-parser

Last synced: 12 Dec 2025

https://github.com/nattapolch/work-order-pdf-extractor

AI-powered Work Order PDF Extractor with OpenAI GPT-4 Vision integration for automated text extraction and file organization

ai automation document-processing gui ocr openai pdf-processing python tkinter work-orders

Last synced: 19 Jun 2025

https://github.com/artemzarubin/xmldocumentprocessor

XmlDocumentProcessor: A .NET component for XML document processing. It analyzes XML content, performs keyword-based queries, and transforms data into HTML. Emphasizes design patterns like Strategy pattern, with a focus on class diagramming. Implements penalty for non-compliance.

c-sharp document-processing dotnet xml xml-processing

Last synced: 06 Mar 2025

https://github.com/zyrolasting/dynamic-xml

Apply keyword procedures in a given Racket namespace using X-expressions.

document-processing racket xml

Last synced: 17 Jun 2025

https://github.com/adhikaritusharaaa/document_cleaning_cli

A deep learning-based pipeline for cleaning scanned document images. Automatically removes noise, enhances text clarity, and optimizes images for OCR. 🚀

cli-tool computer-vision deep-learning denoising document-processing image-cleaning image-processing ocr pytesseract python scanned-documents

Last synced: 17 Jun 2025