Projects in Awesome Lists tagged with document-processing
A curated list of projects in awesome lists tagged with document-processing .
https://github.com/zipstack/unstract
LLM-Driven Extraction of Unstructured Data — Built for API Deployments & ETL Pipeline Workflows
api-deployments data-extraction document-processing etl-pipelines open-source-data-pipeline unstructured-data-extraction
Last synced: 13 May 2026
https://github.com/run-llama/liteparse
A fast, helpful, and open-source document parser
document-ocr document-processing ocr ocr-recognition pdf pdf-parser text-extraction
Last synced: 25 May 2026
https://github.com/enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python
Last synced: 04 Apr 2025
https://github.com/enoch3712/extractthinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python
Last synced: 14 May 2025
https://github.com/eclaire-labs/eclaire
Local-first, open-source AI assistant for your data. Unify tasks, notes, docs, photos, and bookmarks. Private, self-hosted, and extensible via APIs.
ai ai-assistant automation bookmark-manager bookmarks data-extraction document-processing llm local-first note-taking ocr on-device-ai open-source personal-knowledge-management privacy rest-api self-hosted task-management web-archiving
Last synced: 16 Jan 2026
https://github.com/yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
data-extraction document-processing fast image-extraction llm markdown pdf pdf-editor pdf-generation pdf-library pdf-parser pdf-to-markdown pdf-to-text pyo3 python rag rust text-extraction
Last synced: 13 May 2026
https://github.com/dhlab-epfl/dhSegment
Generic framework for historical document processing
document-processing historical-data python3 segmentation tensorflow
Last synced: 15 Mar 2025
https://github.com/wxyhgk/retain-pdf
在保留版面、公式与结构的前提下进行 PDF 翻译,适用于科研与技术文档
document-ai document-processing layout-preserving ocr pdf scientific-papers translation typst
Last synced: 18 May 2026
https://github.com/awslabs/project-lakechain
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
aws aws-cdk computer-vision document-processing generative-ai hacktoberfest machine-learning natural-language-processing retrieval-augmented-generation serverless
Last synced: 16 May 2025
https://github.com/bzsanti/oxidizePdf
a PDF library for rust
crates-io data-extraction digital-signatures document-processing encryption invoice ocr pdf pdf-generation pdf-library pdf-manipulation pdf-parser pdf-reader pdfa rust rust-library table-extraction text-extraction
Last synced: 29 Apr 2026
https://github.com/awslabs/rhubarb
A Python framework for multi-modal document understanding with Amazon Bedrock
amazon-bedrock document-processing generative-ai intelligent-document-processing multi-modal
Last synced: 02 Apr 2026
https://github.com/iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
document-conversion document-processing information-retrieval pdf-converter pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction
Last synced: 10 Apr 2025
https://github.com/steindani/pandoc-include
An include filter for Pandoc
document-processing markdown pandoc pandoc-filter
Last synced: 21 Oct 2025
https://github.com/pspdfkit/nutrient-document-engine-mcp-server
A Model Context Protocol (MCP) server implementation exposes document processing capabilities through natural language, supporting both direct human interaction and AI agent tool calling.
agentic-ai document-processing document-processor mcp-server
Last synced: 05 Sep 2025
https://github.com/superdoc-dev/docx-corpus
The largest open corpus of .docx files for document processing research
bun common-crawl corpus dataset document-processing docx machine-learning nlp typescript word-documents
Last synced: 12 Mar 2026
https://github.com/aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
document-analysis document-processing
Last synced: 06 Mar 2026
https://github.com/martin-papy/qdrant-loader
Enterprise-ready vector database toolkit for building searchable knowledge bases from multiple data sources. Supports multi-project management, automatic ingestion from Confluence/JIRA/Git, intelligent file conversion (PDF/Office/images), and semantic search. Includes MCP server for seamless AI assistant integration.
cli-tool confluence-integration cursor-ide developer-tools document-processing embbedings enterprise-ready file-conversion git-integration jira-integration knowledge-base llm-integration mcp-server multi-project openai python rag semantic-search
Last synced: 12 May 2026
https://github.com/cburschka/lyx
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
document-processing latex lyx mirror
Last synced: 05 May 2025
https://github.com/ahnafnafee/local-llm-pdf-ocr
Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI & CLI.
document-processing fastapi local-llm no-api-key ocr offline-ai olmocr pdf-ocr privacy-focused python searchable-pdf surya-ocr vision-llm web-ui
Last synced: 24 Apr 2026
https://github.com/jmanhype/dspy-multi-document-agents
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
ai distributed-systems document-processing knowledge-management nlp query-optimization vector-search
Last synced: 13 Apr 2025
https://github.com/greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing
Last synced: 07 May 2025
https://github.com/ibm/docling-graph
Transform unstructured documents into validated, rich and queryable knowledge graphs.
ai convert docling document-processing knowledge-graph
Last synced: 26 Jan 2026
https://github.com/eklem/stopword-trainer
A module for creating stopword lists for any language, based on a set of documents.
document-processing information-retrieval nlp stopwords stopwords-removal
Last synced: 05 Jul 2025
https://github.com/thammuio/doc-genius-ai
DocGenius AI - Generative AI Chatbot for your Documents
ai-agents cloudera cloudera-machine-learning cml data-analytics document-processing enterprise-ai-solutions genai genai-chatbot llm machine-learning retrieval-augmented-generation
Last synced: 23 Jan 2026
https://github.com/run-llama/llama-cloud-ts
Typescript SDK for OCR and document parsing in the cloud with LlamaParse
agent agents document-agent document-processing information-extraction llamaparse ocr parser
Last synced: 08 Apr 2026
https://github.com/amadeusitgroup/docs2vecs
CLI that helps with docs splitting, embedding and exposing them in a seamless manner
azure-ai chromadb cli-tool data-ingestion docker document-processing embeddings llm mongodb natural-language-processing python rag semantic-search text-embedding vector-database
Last synced: 05 Mar 2026
https://github.com/abdur75648/urdu-text-detection
Text line detection for Urdu OCR (UTRNet)
contournet document-processing ocr text-detection urdu-ocr urdu-text-detection utrnet
Last synced: 30 Apr 2025
https://github.com/centralfloridaattorney/zmongo_retriever
Use data from MongoDB in LangChain, Llama and OpenAI
data-chunking data-retrieval database document-processing langchain llamacpp machine-learning mongo mongodb openai python
Last synced: 05 Mar 2026
https://github.com/codelined-ag/extracto
Your private document brain. PDFs in, RAG out. Self-hosted. Plug everywhere.
agents bun claude docker document-processing mcp mcp-server mistral nextjs ocr ollama openrouter pdf-ocr rag self-hosted vector-database vision-models
Last synced: 10 May 2026
https://github.com/b-a-m-n/flockparser
Distributed document RAG system with intelligent GPU/CPU orchestration. Auto-discovers heterogeneous nodes, routes workloads adaptively, and achieves 60x+ speedups through VRAM-aware load balancing. Privacy-first architecture with 4 interfaces (CLI, API, MCP, Web UI). Real distributed systems engineering, not just an API wrapper.
api auto-discovery chromadb cli distributed-rag document-processing gpu-orchestration heterogeneous-computing llm load-balancing mcp ollama privacy-first python rag semantic-search vector-database vram-aware web-ui workload-orchestration
Last synced: 14 Dec 2025
https://github.com/sourish-kanna/smartaudit-llm
Autonomous multi-agent system for intelligent invoice auditing using LLaMA 3 + Mistral. Features rule-based compliance checks, role-based summaries (Legal, Managerial, Accounting), and a React + FastAPI pipeline.
document-processing fastapi invoice-agent llama3 llm mistral multi-agent react
Last synced: 06 Feb 2026
https://github.com/rushi-balapure/pdf_2_json_extractor
A high-performance Python library for extracting structured content from PDF documents with layout-aware text extraction. pdf_to_json preserves document structure including headings (H1-H6) and body text, outputting clean JSON format.
cli-tool cpu-only cross-platform data-extraction document-parsing document-processing json layout-analysis nlp offline pdf pdf-extraction pdf-parser pdf-processing pdf-to-json python python-library structure-extraction text-extraction
Last synced: 21 Apr 2026
https://github.com/r-uben/socr
Multi-engine OCR with cascading fallback, quality audit, and figure extraction
deepseek document-processing gemini nougat ocr pdf
Last synced: 24 Apr 2026
https://github.com/johnsirmon/clearcouncil
ClearCouncil: Automated tools for collecting, organizing, and embedding publicly available local state county council documents (minutes, agendas) into LLMs. Python, JS, and wget scripts included for easy data retrieval and integration.
civic-tech data-retrieval document-processing gpt langchain langchain-python local-government open-data openai retrieval-augmented-generation transparency-enhancing-technologies wget
Last synced: 26 Feb 2026
https://github.com/vakharwalad23/mark-minion
The Ultimate Web Content Extraction & Conversion Tool for AI/LLM Applications. Convert almost any web content into clean Markdown with intelligent AI processing.
ai-powered cloudflare-worker content-extraction document-processing markdown-conversion puppeteer tweets-extraction typescript web-scraping
Last synced: 18 Jun 2025
https://github.com/caltechlibrary/popstar
Phone-Oriented Processing SofTware for ARchives
archiving digitization document-processing iphone libraries scanning shortcuts-app workflow-automation
Last synced: 20 Mar 2026
https://github.com/h0neyp0t-466/pen2pdf
"📝 Pen2PDF – AI-powered web app to transform handwritten notes, slides, PDFs & images into editable Markdown ✏️ → export as polished PDFs 📄. Features drag & drop 📤, real-time editing ⚡, responsive UI 📱, and Google Gemini 🤖 integration. Perfect for students, creators & pros 🚀."
ai-app ai-text-extraction document-processing express file-converter google-gemini handwritten-notes javascript markdown-editor nodejs ocr pdf-converter pdf-to-markdown pdf-tools pen2pdf ppt-to-pdf react text-extraction vite web-app
Last synced: 15 Mar 2026
https://github.com/np-compete/pageindex
Vectorless, reasoning-based RAG using hierarchical document indexing with Vertex AI. No embeddings, no vector DB - just LLM reasoning.
ai document-processing gemini hacktoberfest hierarchical-indexing llm machine-learning nlp pdf python rag retrieval-augmented-generation vertex-ai
Last synced: 10 Mar 2026
https://github.com/aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai
This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.
aws document-processing fine-tuning huggingface idp llama multimodal qwen2-vl sagemaker sft swift
Last synced: 03 Oct 2025
https://github.com/hasnaintypes/lawbotics-v2
LawBotics v2 is an AI-powered legal contract analysis platform that combines machine learning with modern web technologies to automate legal document review and clause extraction.
ai authentication clerk convex cuad-dataset document-processing fine-tuning full-stack langchain legal-document-analyzer legal-tech monorepo nextjs shadcn tailwindcss typescript
Last synced: 17 Aug 2025
https://github.com/quarkiverse/quarkus-docling
Docling simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem
ai docling document-processing embedding quarkus quarkus-extension rag
Last synced: 01 Jan 2026
https://github.com/zircote/rlm-rs-plugin
Claude Code plugin for processing documents 100x larger than context limits using the Recursive Language Model pattern. Rust-powered chunking, hybrid semantic + BM25 search, and sub-LLM orchestration.
ai-agents bm25 chunking claude-code claude-code-plugin document-processing hybrid-search llm long-context recursive-language-model rlm rust semantic-search sqlite
Last synced: 08 Apr 2026
https://github.com/shahin-ro/table-detection
Python tool for table extraction & Persian OCR. Uses OpenCV for table detection, Tesseract for text extraction, & Pandas for data output. Visualizes cells & text. Ideal for Persian documents! 📄✨
colab computer-vision data-extraction data-visualization document-processing image-analysis image-processing machine-learning matplotlib numpy ocr opencv pandas persian-ocr persian-text python table-detection table-extraction tesseract text-recognition
Last synced: 08 Apr 2026
https://github.com/jdm-github/debahra-efficio
DEHBARA (Efficio) is a React and Express-based web application designed to streamline service requests for DTI, SSS, and other document processing needs. It simplifies the process of requesting official papers and services, integrating cloud storage for efficient data management.
cloud-database document-processing dti express government-services react sss web-application
Last synced: 14 Apr 2025
https://github.com/x1ao4/doc-merger
通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script
data-analysis data-merging document-analysis document-comparison document-processing documents filtering filtering-data merge merge-documents
Last synced: 28 Jun 2025
https://github.com/cerno-ai/cerno-insight
High-performance RAG system for intelligent document Q&A with hybrid retrieval, GPU acceleration, and citation-backed answers. Upload docs, ask questions, get precise responses.
artificial-intelligence bm25 docker document-processing embeddings faiss fastapi llms local-first machine-learning natural-language-processing nextjs openai python rag rag-pipeline reranking retreival-augmented-generation semantic-search typescript
Last synced: 08 Nov 2025
https://github.com/patteg21/pigeon-evals
A End-To-End RAG Pipeline that includes Evaluations, iterations, and swappable components. At its core it allows users to be able to try different embedding models and techniques.
benchmarking document-processing embeddings evaluation llm mcp nlp pipeline processing python rag retrieval-augmented-generation vector-database vector-search
Last synced: 28 Apr 2026
https://github.com/trsdn/mistraldocai-mcp
MCP (Model Context Protocol) server for document-to-Markdown conversion using Mistral AI OCR. Compatible with Claude Desktop and other MCP clients.
claude-desktop document-processing markdown mcp-server mistral-ai ocr pdf-converter python typescript
Last synced: 25 Sep 2025
https://github.com/yuvaraj3855/preocr
Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.
computer-vision document-analysis document-classification document-intelligence document-processing document-understanding file-analysis image-processing layout-analysis ocr ocr-detection opencv pdf pdf-analysis pdf-parsing preprocessing python python-library text-detection text-extraction
Last synced: 16 Feb 2026
https://github.com/natgluons/AI-docs-analyzer-API
A smart document processing system built with an open-source multimodal LLM and OCR (DocTR/TrOCR), using FastAPI, Supabase, PgVector, Azure Functions, and Neo4j to automate invoice analysis and identity document verification.
document-processing document-verification fastapi llm multimodal-large-language-models neo4j-graph ocr ocr-text-reader pgvector supabase
Last synced: 19 Jun 2025
https://github.com/mancrurod/resume-optimization
Resume-Optimization automates resume enhancement using AI by converting .docx resumes into Markdown, tailoring them to specific job descriptions, and exporting the results in HTML and PDF formats.
automation career-development document-processing gpt-integration job-matching markdown-to-html natural-langauge-processing pdf-generation python resume-optimization resume-parser solid-principles
Last synced: 08 Apr 2025
https://github.com/jromero132/pdf-merger
A Python utility for merging multiple PDFs and images into a single PDF file. This tool maintains aspect ratios, centers content on custom-sized pages (default A4), and supports recursive directory processing. Perfect for organizing documents and creating cohesive PDF compilations.
aspect-ratio command-line-tool content-center cross-platform custom-page directory-recursive document-management document-processing file-conversion file-organization image-processing image-to-pdf multi-format-support open-source pdf-merger pdf-tools productivity-tool python python-utility python3
Last synced: 03 Apr 2025
https://github.com/jadenszewczak/pcx-automation
Automation toolkit for PCX document import/export operations
automation automations cli-app cli-application cli-tool cli-tools document-database document-generation document-generator document-management document-processing internal internal-tool internal-tools pcx python python-3 python-app python3
Last synced: 09 Oct 2025
https://github.com/thoth2357/watermark-removal
Program Helps remove watermark from a pdf document
document-processing watermarking
Last synced: 10 Oct 2025
https://github.com/jromero132/pdf-splitter
PDF Splitter is a Python tool that takes a multi-page PDF file and splits it into individual PDF files, one for each page of the original document.
aspect-ratio command-line-tool content-center cross-platform custom-page document-management document-processing file-conversion file-organization image-processing image-to-pdf multi-format-support open-source pdf-merger pdf-splitter pdf-tools productivity-tool python python-utility python3
Last synced: 14 Oct 2025
https://github.com/bneweling/neuronode
🧠 Neuronode - Enterprise-grade Knowledge Management System with LiteLLM, Neo4j, and Vector Search. AI-powered document processing, intelligent relationship discovery, and advanced query orchestration.
ai document-processing enterprise knowledge-management litellm llm neo4j python typescript vector-search
Last synced: 02 May 2026
https://github.com/renamed-to/renamed-sdk
Multi-language SDKs (TypeScript, Python, Go, Java, C#, Ruby, Rust, Swift, PHP) for AI-powered document processing
ai csharp document-processing file-renaming-automation go java ocr pdf pdf-splitter php python ruby rust swift typescript
Last synced: 28 Jan 2026
https://github.com/jztan/pdf-mcp
Production-ready MCP server for PDF processing with intelligent caching. Extract text, search, and analyze PDFs with AI agents.
agentic-ai ai claude codex-cli copilot document-processing llm mcp model-context-protocol opencode pdf python
Last synced: 14 Mar 2026
https://github.com/acsenrafilho/cucaracha
A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis
document-analysis document-classification document-processing optical-character-recognition python3
Last synced: 28 Oct 2025
https://github.com/bjornmelin/pdfusion
A lightweight Python utility for effortlessly merging multiple PDF files into a single document.
automation batch-processing cli command-line-tool document-management document-processing file-management pdf pdf-manipulation pdf-merger pdf-tools pypdf2 python python-library utilities
Last synced: 27 Mar 2025
https://github.com/baughmann/tikara
The metadata and text content extractor for almost every file type.
apache-tika content-extraction document-parsing document-processing docx image-to-text java language-detection llm metadata metadata-extraction ml natural-language-processing ocr pdf-to-text retrieval-augmented-generation text-extraction text-mining
Last synced: 16 Feb 2026
https://github.com/ma3u/neo4j-agentframework
🚀 Hybrid RAG: Local Neo4j + BitNet.cpp RAG System and Azure SaaS deployment. Fast vector search, instant Docker deployment via GitHub Container Registry. Complete RAG pipeline with ultra-efficient LLMs for enterprise knowledge management.
ai-agents azure-openai bitnet cypher docker document-processing enterprise-ai github-container-registry graph-database hybrid-search knowledge-graph llm-inference machine-learning neo4j python quantization rag semantic-search vector-embeddings zero-build-time
Last synced: 29 Apr 2026
https://github.com/saviobatista/vitae
AI-powered résumé transformer: match your CV to any job and export in LaTeX PDF.
ai-resume career-tools document-processing job-applications latex openai oss pdf-parser resume-builder tailored-resume typescript vercel
Last synced: 08 May 2026
https://github.com/divyamohan1993/async-document-processing-workflow
Production-grade async document processing workflow system. Next.js 14 + FastAPI + Celery + Redis Pub/Sub + PostgreSQL. Real-time SSE progress, JWT auth, Docker Compose deployment.
async celery docker document-processing fastapi full-stack nextjs postgresql python react real-time redis tailwindcss typescript workflow
Last synced: 13 Apr 2026
https://github.com/adhikaritusharaaa/document_cleaning_cli
A deep learning-based pipeline for cleaning scanned document images. Automatically removes noise, enhances text clarity, and optimizes images for OCR. 🚀
cli-tool computer-vision deep-learning denoising document-processing image-cleaning image-processing ocr pytesseract python scanned-documents
Last synced: 17 Jun 2025
https://github.com/guiss-guiss/scriptumai
RAG Application ScriptumAI is an advanced Retrieval-Augmented Generation platform designed for document ingestion, semantic search, and query processing.
ai document-ingestion document-processing file-upload flask language-model llama llm machine-learning multi-language nlp offline ollama pdf-processing private python rag retrieval-augmented-generation semantic-search text-analysis
Last synced: 13 Apr 2026
https://github.com/jasoncobra3/floorplan-dimractor
A sophisticated Python pipeline for automatically extracting dimensions and cabinet codes from architectural floorplan PDFs. This tool converts various dimension formats into standardized measurements and provides structured output with visualization capabilities.
architecture-tools automation-tools blueprint-analysis cad-automation computer-vision dimension-extraction document-processing document-processing-pipeline floorplan-analysis image-processing measurement-tools opencv pdf-parser pdf-processing pdfplumber pymupdf streamlit text-detection
Last synced: 18 Apr 2026
https://github.com/e-candeloro/credem_hack_2025
AI-powered document processing pipeline for Credem Hackathon 2025. Leverages Google Cloud AI services to intelligently extract, classify, and process HR documents through a robust ETL pipeline.
ai document-processing googlecloudplatform hackathon llm prompt-engineering python
Last synced: 15 May 2026
https://github.com/theogyeezy/rag-multi-agent-template
RAG enabled multi agent template using CrewAI and WatsonxAI. Supports ChromaDB, FAISS, Pinecone with document processing for PDF/DOCX/TXT. Includes legal, technical, and customer support examples.
agent ai crewai document-processing knowledge-base langchain multi-agent multiagent multiagenttemplate nlp python rag ragtemplate template vector-database watsonx watsonxai
Last synced: 08 May 2026
https://github.com/agxp/docpulse
Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT
async document-extraction document-processing go gpt-4o json-schema llm multi-tenant ocr openai pdf postgresql rest-api structured-data tesseract worker
Last synced: 13 Mar 2026
https://github.com/bylickilabs/pdfanalyzer
PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.
automation cli document-analysis document-processing file-analyzer file-inspector metadata open-source pdf pdf-analysis pdf-extraction python reporting streamlit text-mining
Last synced: 27 Apr 2026
https://github.com/nattapolch/work-order-pdf-extractor
AI-powered Work Order PDF Extractor with OpenAI GPT-4 Vision integration for automated text extraction and file organization
ai automation document-processing gui ocr openai pdf-processing python tkinter work-orders
Last synced: 19 Jun 2025
https://github.com/vericle/intellyweave
AI-powered platform for OSINT intelligence analysis. Features archive discovery with hypothesis-driven investigation, GLiNER entity extraction, Mapbox geospatial visualization, network analysis, and document processing. Built with FastAPI, Next.js, Weaviate, and DSPy.
ai-agents archival-research document-processing dspy entity-extraction fastapi geospatial-visualization gliner historical-research intelligence-analysis knowledge-graph mapbox network-analysis nextjs osint python rag typescript vector-database weaviate
Last synced: 31 Jan 2026
https://github.com/joshuaevan/page-forge
Convert multi-page PDFs into clean, OCR-ready images — CLI, watch folder, web UI, and optional Google Drive sync.
cli docker document-processing fastapi google-drive image-processing ocr ocr-preprocessing pdf pdf-converter pdf-to-image pillow pymupdf python self-hosted
Last synced: 28 Apr 2026
https://github.com/byerlikaya/smartrag
SmartRAG is a production-ready .NET 9.0 library that provides a complete Retrieval-Augmented Generation (RAG) solution. Features include multi-provider AI support (OpenAI, Anthropic, Gemini), enterprise vector storage (Qdrant, Redis, SQLite), and intelligent document processing (PDF, Word, Text).
ai anthropic csharp document-processing document-qa dotnet enterprise-ai gemini llm machine-learning natural-language-processing openai pdf-parser qdrant rag redis retrieval-augmented-generation vector-database word-parser
Last synced: 05 Feb 2026
https://github.com/debugger404/rag-powered-gpt-4-chatbot
🚀 Revolutionize your data interaction with a cutting-edge chatbot built on Retrieval-Augmented Generation (RAG) and OpenAI’s GPT-4. Upload documents, create custom knowledge bases, and get precise, contextual answers. Ideal for research, business operations, customer support, and more!
ai-chatbot ai-powered-chatbot azure-openai business-chatbot custom-knowledge-base customer-support-chatbot document-chatbot document-processing gpt-4 knowledge-management knowledge-retrieval machine-learning-chatbot natural-language-processing openai pdf-search rag research-chatbot retrieval-augmented-generation semantic-search vector-database
Last synced: 16 May 2026
https://github.com/zircote/rlm-rs
Rust CLI implementing the Recursive Language Model (RLM) pattern for Claude Code. Process documents 100x larger than context windows through intelligent chunking, SQLite persistence, and recursive sub-LLM orchestration.
ai-tools chunking claude claude-code cli command-line context-window devtools document-processing llm mit-license mmap rayon recursive-language-model rlm rust rust-2024 semantic-chunking sqlite text-processing
Last synced: 08 Feb 2026
https://github.com/arthrod/html-to-docx
Convert HTML to DOCX with structure and formatting preserved.
conversion document-processing docx html office ooxml word
Last synced: 26 May 2026
https://github.com/artemzarubin/xmldocumentprocessor
XmlDocumentProcessor: A .NET component for XML document processing. It analyzes XML content, performs keyword-based queries, and transforms data into HTML. Emphasizes design patterns like Strategy pattern, with a focus on class diagramming. Implements penalty for non-compliance.
c-sharp document-processing dotnet xml xml-processing
Last synced: 21 Apr 2026
https://github.com/sdpdas/document_annotate_tool
Adds annotation to each element in document and defines what it is.
document-processing python python-docx xml
Last synced: 05 Oct 2025
https://github.com/syncfusion/document-sdk-blazor-demos
Explore the Syncfusion Blazor demos featuring our advanced PDF, Word, Excel, and PowerPoint document processing libraries.
blazor document-processing excel pdf powerpoint word
Last synced: 17 May 2026
https://github.com/alexfikl/uvt-scholarly
Generate citation lists and other documents for WUT (West University of Timișoara) hiring, accreditation, habilitation, etc.
document-processing uefiscdi uvt
Last synced: 25 Feb 2026
https://github.com/aget-framework/template-document-processor-aget
Production-ready template for creating document processing agents with LLM pipelines, security protocols, and multi-provider support
aget-framework document-processing llm python template
Last synced: 03 May 2026
https://github.com/u9401066/asset-aware-mcp
Asset-Aware MCP Server — AI Agent precisely accesses tables, figures, sections from PDFs + .docx round-trip editing (DFM) with 46 tools / 13 resources, segmentation export, layout overlay, OCR preprocessing, knowledge graph (LightRAG)
ai document-processing docx etl fastmcp knowledge-graph layout-analysis lightrag llm mcp mcp-server medical ocr pdf python rag segmentation
Last synced: 13 May 2026
https://github.com/PSPDFKit-labs/nutrient-agent-skill
Universal Agent Skill for document processing with Nutrient DWS API — works with Claude Code, Codex, Gemini CLI, Cursor, and 35+ more agents
agent-skills ai-agents claude-code codex cursor document-processing gemini-cli mcp nutrient pdf
Last synced: 07 Mar 2026
https://github.com/ceratops-code/pdf-form-tools
Template-aware Python helpers for filling layout-sensitive scanned PDF forms
document-processing forms opencv pdf pymupdf python
Last synced: 16 Apr 2026
https://github.com/credeed/credeed-pdf-to-markdown
Convert PDF to Markdown using AI, can be used for Agent to understand documents.
ai-agent credit-profile document-processing financial-report pdf-to-markdown risk-assessment
Last synced: 27 Mar 2026
https://github.com/maaarcooo/claude-skills
A collection of Claude Agentic Skills for PDF extraction, Anki flashcard generation, and revision notes creation
anki-flashcards claude-ai claude-skills document-processing markdown-converter pdf-extraction python revision-notes study-tools
Last synced: 06 Mar 2026
https://github.com/aidalinfo/extract-kit
Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.
ai-sdk document-processing pdf pdf-extraction vision-llm
Last synced: 15 Aug 2025
https://github.com/ojspace/md-anything
Local-first CLI + MCP server: Any file → AI-ready Markdown + JSON
ai-agents claude cursor document-processing markdown mcp pdf-conversion rag vscode
Last synced: 01 Apr 2026
https://github.com/nsourlos/ocr_and_rag
Tests of OCR and RAG with LLMs
cohere colpali document-processing gemini information-retrieval mistral ocr openai pdf-text-extraction qwen2-vl rag
Last synced: 06 May 2026
https://github.com/chernistry/shafi
Evidence-first legal QA system — 300+ DIFC documents, 16 retrieval experiments, 87.5% rejected after ablation. LangGraph + Qdrant + FastAPI. Agentic RAG Challenge 2026.
agentic-rag ai-agents competition document-processing evaluation fastapi langchain langgraph legal-ai legal-tech llm nlp openai python qdrant question-answering rag retrieval-augmented-generation vector-search
Last synced: 02 Apr 2026
https://github.com/syedaliwaqar12/resume-parser
🚀 A beautiful, production-ready web app that extracts structured data from PDF resumes using AI and NLP. Built with React + TypeScript + FastAPI.
ai-resume-analysis-job-matching-ml-cv-parser automation document-processing fastapi file-upload heroku-deployment job-application-tool netlify-deployment nlp-processing pdf-parser python-api reactjs resume-parser spacy-nlp tailwind-css text-automation text-extraction typescript web-application
Last synced: 02 Apr 2026
https://github.com/objones25/document-scanner-summarizer
📄 AI-powered document scanner and summarizer with OCR, supporting images, PDFs, DOCX, and web pages. Features interactive CLI with streaming responses from Claude, GPT, or Gemini.
ai anthropic cli document-processing nlp ocr openai pdf python tesseract
Last synced: 20 Apr 2026
https://github.com/terry-li-hm/prometheus
PDF Liberation MCP Server - Break large PDFs into digestible chunks for Claude
ai-tools claude-code document-processing fastmcp mcp-server pdf-processing pdf-splitter prometheus pymupdf python text-extraction
Last synced: 18 May 2026
https://github.com/sapientpants/doctrans
Privacy-first document translation powered by local AI (Ollama). Upload PDF, Word, and OpenDocument files, extract text via vision models, and translate — all processing stays on your device.
ai document-processing docx elixir local-ai ocr ollama pdf phoenix phoenix-liveview privacy rag semantic-search tailwindcss translation
Last synced: 22 Apr 2026
https://github.com/roberto-a-cardenas/intellidoc-engine
Serverless OCR pipeline on AWS using Lambda, API Gateway, S3, and Textract. Accepts base64 PDFs and returns extracted text via API. Built with Terraform.
api-gateway aws aws-lambda cloud-engineering document-processing ocr s3 serverless terraform textract
Last synced: 22 Apr 2026
https://github.com/valido-app/valido-app.github.io
Official website and download page for Valido - Professional PDF validation and data extraction tool for Windows
automation data-extraction-from-pdf document-processing document-validation pdf pdf-parser pdf-processing pdf-tools
Last synced: 23 Apr 2026
https://github.com/josh-janse/pdf-to-markdown-extractor
Convert PDF documents to clean markdown using Google's Gemini API.
ai document-processing gemini-api markdown nodejs pdf text-extraction text-extraction-from-pdf
Last synced: 07 May 2026
https://github.com/samay-jain/voice_assistant_rag_system_using_langchain_and_streamlit
Voice Assistant RAG System using LangChain, Whisper, and Streamlit - A voice-enabled assistant that lets you ask questions by speaking, processes your custom documents, and responds with natural speech. Built with LangChain, Ollama, Whisper, ElevenLabs, and Streamlit.
ai-assistant document-processing elevenlabs faiss langchain llm ollama python rag retrieval-augmented-generation speech-recognition streamlit text-to-speech voice-assistant whisper
Last synced: 10 Apr 2026