Projects in Awesome Lists tagged with document-processing
A curated list of projects in awesome lists tagged with document-processing .
https://github.com/enoch3712/extractthinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python
Last synced: 14 May 2025
https://github.com/enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python
Last synced: 04 Apr 2025
https://github.com/dhlab-epfl/dhSegment
Generic framework for historical document processing
document-processing historical-data python3 segmentation tensorflow
Last synced: 15 Mar 2025
https://github.com/awslabs/project-lakechain
:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
aws aws-cdk computer-vision document-processing generative-ai hacktoberfest machine-learning natural-language-processing retrieval-augmented-generation serverless
Last synced: 16 May 2025
https://github.com/iamarunbrahma/pdf-to-markdown
Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.
document-conversion document-processing information-retrieval pdf-converter pdf-extraction pdf-parsing pdf-to-markdown python rag retrieval-augmented-generation text-extraction
Last synced: 10 Apr 2025
https://github.com/steindani/pandoc-include
An include filter for Pandoc
document-processing markdown pandoc pandoc-filter
Last synced: 30 Mar 2025
https://github.com/aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
document-analysis document-processing
Last synced: 15 Apr 2025
https://github.com/cburschka/lyx
Unofficial mirror of git://git.lyx.org/lyx.git (updates daily. not affiliated with lyx.org.)
document-processing latex lyx mirror
Last synced: 05 May 2025
https://github.com/jmanhype/dspy-multi-document-agents
An advanced distributed knowledge fabric for intelligent document processing, featuring multi-document agents, optimized query handling, and semantic understanding.
ai distributed-systems document-processing knowledge-management nlp query-optimization vector-search
Last synced: 13 Apr 2025
https://github.com/greed2411/tokyo
tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.
apache-tika clojure document-processing extension extract-text filetype mime-types ring text-extraction text-parser text-parsing
Last synced: 07 May 2025
https://github.com/eklem/stopword-trainer
A module for creating stopword lists for any language, based on a set of documents.
document-processing information-retrieval nlp stopwords stopwords-removal
Last synced: 14 Apr 2025
https://github.com/abdur75648/urdu-text-detection
Text line detection for Urdu OCR (UTRNet)
contournet document-processing ocr text-detection urdu-ocr urdu-text-detection utrnet
Last synced: 30 Apr 2025
https://github.com/centralfloridaattorney/zmongo_retriever
Use data from MongoDB in LangChain, Llama and OpenAI
data-chunking data-retrieval database document-processing langchain llamacpp machine-learning mongo mongodb openai python
Last synced: 10 Feb 2025
https://github.com/caltechlibrary/popstar
Phone-Oriented Processing SofTware for ARchives
archiving digitization document-processing iphone libraries scanning shortcuts-app workflow-automation
Last synced: 13 Apr 2025
https://github.com/aws-samples/sample-for-multi-modal-document-to-json-with-sagemaker-ai
This open-source project delivers a complete pipeline for converting multi-page documents (PDFs/images) into structured JSON using Vision LLMs on Amazon SageMaker. The solution leverages the SWIFT Framework to fine-tune models specifically for document understanding tasks.
aws document-processing fine-tuning huggingface idp llama multimodal qwen2-vl sagemaker sft swift
Last synced: 23 Mar 2025
https://github.com/baughmann/tikara
The metadata and text content extractor for almost every file type.
apache-tika content-extraction document-parsing document-processing docx image-to-text java language-detection llm metadata metadata-extraction ml natural-language-processing ocr pdf-to-text retrieval-augmented-generation text-extraction text-mining
Last synced: 13 May 2025
https://github.com/bjornmelin/pdfusion
A lightweight Python utility for effortlessly merging multiple PDF files into a single document.
automation batch-processing cli command-line-tool document-management document-processing file-management pdf pdf-manipulation pdf-merger pdf-tools pypdf2 python python-library utilities
Last synced: 27 Mar 2025
https://github.com/x1ao4/doc-merger
通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script
data-analysis data-merging document-analysis document-comparison document-processing documents filtering filtering-data merge merge-documents
Last synced: 20 Feb 2025
https://github.com/thoth2357/watermark-removal
Program Helps remove watermark from a pdf document
document-processing watermarking
Last synced: 25 Feb 2025
https://github.com/mancrurod/resume-optimization
Resume-Optimization automates resume enhancement using AI by converting .docx resumes into Markdown, tailoring them to specific job descriptions, and exporting the results in HTML and PDF formats.
automation career-development document-processing gpt-integration job-matching markdown-to-html natural-langauge-processing pdf-generation python resume-optimization resume-parser solid-principles
Last synced: 08 Apr 2025
https://github.com/jdm-github/debahra-efficio
DEHBARA (Efficio) is a React and Express-based web application designed to streamline service requests for DTI, SSS, and other document processing needs. It simplifies the process of requesting official papers and services, integrating cloud storage for efficient data management.
cloud-database document-processing dti express government-services react sss web-application
Last synced: 14 Apr 2025
https://github.com/jromero132/pdf-merger
A Python utility for merging multiple PDFs and images into a single PDF file. This tool maintains aspect ratios, centers content on custom-sized pages (default A4), and supports recursive directory processing. Perfect for organizing documents and creating cohesive PDF compilations.
aspect-ratio command-line-tool content-center cross-platform custom-page directory-recursive document-management document-processing file-conversion file-organization image-processing image-to-pdf multi-format-support open-source pdf-merger pdf-tools productivity-tool python python-utility python3
Last synced: 03 Apr 2025
https://github.com/jromero132/pdf-splitter
PDF Splitter is a Python tool that takes a multi-page PDF file and splits it into individual PDF files, one for each page of the original document.
aspect-ratio command-line-tool content-center cross-platform custom-page document-management document-processing file-conversion file-organization image-processing image-to-pdf multi-format-support open-source pdf-merger pdf-splitter pdf-tools productivity-tool python python-utility python3
Last synced: 03 Apr 2025
https://github.com/debugger404/rag-powered-gpt-4-chatbot
🚀 Revolutionize your data interaction with a cutting-edge chatbot built on Retrieval-Augmented Generation (RAG) and OpenAI’s GPT-4. Upload documents, create custom knowledge bases, and get precise, contextual answers. Ideal for research, business operations, customer support, and more!
ai-chatbot ai-powered-chatbot azure-openai business-chatbot custom-knowledge-base customer-support-chatbot document-chatbot document-processing gpt-4 knowledge-management knowledge-retrieval machine-learning-chatbot natural-language-processing openai pdf-search rag research-chatbot retrieval-augmented-generation semantic-search vector-database
Last synced: 08 Apr 2025
https://github.com/sdpdas/document_annotate_tool
Adds annotation to each element in document and defines what it is.
document-processing python python-docx xml
Last synced: 01 Apr 2025
https://github.com/kazkozdev/researchify
🔬 Scientific chatbot that instantly searches arXiv.org papers, transforming an ocean of preprints into clear research insights. Powered by local LLMs from Ollama.
academic-tools api artificial-intelligence arxiv chatbot document-analysis document-processing llm machine-learning nlp nlp-machine-learning ollama paper-search rag research-assistant research-tools scientific scientific-computing scientific-papers
Last synced: 05 Apr 2025
https://github.com/artemzarubin/xmldocumentprocessor
XmlDocumentProcessor: A .NET component for XML document processing. It analyzes XML content, performs keyword-based queries, and transforms data into HTML. Emphasizes design patterns like Strategy pattern, with a focus on class diagramming. Implements penalty for non-compliance.
c-sharp document-processing dotnet xml xml-processing
Last synced: 06 Mar 2025
https://github.com/node0/timbermill
OCR-powered chat session renderer that slices long conversations into paginated, searchable PDFs
chat-archive chatgpt cv2 document-processing llm-tools ocr pdf-generation python
Last synced: 17 Apr 2025
https://github.com/oeo/processor-rs
High-performance document processing pipeline in Rust. Extracts text, performs OCR, and optimizes images from PDFs and other document formats with parallel processing and memory efficiency.
document-processing image-optimization parallel-processing rust tesseract-ocr text-extraction
Last synced: 15 Mar 2025
https://github.com/adhikaritusharaaa/document_cleaning_cli
A deep learning-based pipeline for cleaning scanned document images. Automatically removes noise, enhances text clarity, and optimizes images for OCR. 🚀
cli-tool computer-vision deep-learning denoising document-processing image-cleaning image-processing ocr pytesseract python scanned-documents
Last synced: 04 Mar 2025
https://github.com/fayazk/document-metadata-extractor
A Python tool that uses Google's Gemini AI to automatically extract structured metadata from PDF and DOCX documents, saving results to Excel for easy analysis and organizing raw responses as JSON files.
content-indexing data-extraction document-management document-processing docx-parser excel-export gemini-ai-project generative-ai json-output metadata-extraction nlp pdf-parser python-automation text-analysis
Last synced: 01 Apr 2025
https://github.com/maemresen/mae-ghostscript
mae-ghostscript is a Docker-based tool for compressing PDF files efficiently using Ghostscript. This containerized solution simplifies the process of PDF compression, providing a consistent environment that works across different platforms. Users can run the container by mounting their local directories and specifying the PDF to compress.
bash-scripting containerized-application docker document-processing ghostscript pdf-compression
Last synced: 01 Mar 2025
https://github.com/terilios/file-upload-embeddings
Enterprise-grade document intelligence platform leveraging vector embeddings and LLMs for advanced document processing, semantic search, and information retrieval.
artificial-intelligence docker document-processing enterprise-software fastapi machine-learning natural-language-processing python semantic-search vector-embeddings
Last synced: 16 Mar 2025
https://github.com/jcaperella29/document_cleaning_cli
A deep learning-based pipeline for cleaning scanned document images. Automatically removes noise, enhances text clarity, and optimizes images for OCR. 🚀
cli-tool computer-vision deep-learning denoising document-processing image-cleaning image-processing ocr pytesseract python scanned-documents
Last synced: 02 Mar 2025
https://github.com/zyrolasting/dynamic-xml
Apply keyword procedures in a given Racket namespace using X-expressions.
document-processing racket xml
Last synced: 17 Mar 2025
https://github.com/guiss-guiss/scriptumai
RAG Application ScriptumAI is an advanced Retrieval-Augmented Generation platform designed for document ingestion, semantic search, and query processing.
ai document-ingestion document-processing file-upload flask language-model llama llm machine-learning multi-language nlp offline ollama pdf-processing private python rag retrieval-augmented-generation semantic-search text-analysis
Last synced: 28 Mar 2025