An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with document-extraction

A curated list of projects in awesome lists tagged with document-extraction .

https://github.com/harishdeivanayagam/rowfill

Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

document document-extraction document-parsing image-ocr langgraph llama llm nextjs ocr ocr-javascript ollama openai pdf pdfs unstructured unstructured-data vision vision-api

Last synced: 13 Apr 2025

https://github.com/alephdata/ingest-file

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

document-extraction documents email-forensics excel forensics forensics-investigations metadata-extraction ocr

Last synced: 07 May 2025

https://github.com/konfuzio-ai/konfuzio-sdk

Run OCR, extract information from documents and classify them. In addition, annotate documents and build custom NLP and computer vision models tailored for your specific use cases. Find examples with code in our Tutorials section of dev.konfuzio.com and get inspiration from Use Cases section of our blog: https://konfuzio.com/en/category/marketplace

computer-vision document-annotate document-annotation document-annotation-tool document-extraction nlp ocr python text-classification text-processing

Last synced: 08 Aug 2025

https://github.com/xyntopia/pydoxtools

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python

Last synced: 11 May 2025

https://github.com/kreuzberg-dev/kreuzberg-cloud

Cloud-native document extraction platform — SaaS at kreuzberg.dev or self-host on any Kubernetes cluster. 90+ formats, REST API, webhooks. Built on Kreuzberg.

api axum busl cloud-native document-extraction document-processing helm kreuzberg kubernetes microservices nats nextjs ocr pdf postgresql rust saas self-hosted text-extraction

Last synced: 05 Jun 2026

https://github.com/aws-samples/sample-badgers

Guidance on deploying a generative AI document analysis with Amazon Bedrock AgentCore. Auto-classifies, enhances, and aggregates multi-type documents using Gestalt-informed vision prompts. Custom analyzer creation wizard. Scripted CDK deployment. Gradio frontend included.

agentcore agentcore-sdk agentic-ai agentic-workflow amazon-nova badgers cdk claude composable-prompts document-extraction document-intelligence document-vision full-text-extraction gestalt prompt-engineering strands-agent-sdk strands-agents vision-models

Last synced: 14 Apr 2026

https://github.com/tammilore/ai-contract-analyzer

AI-powered contract analysis tool

ai document-extraction llms open-source

Last synced: 17 Jul 2025

https://github.com/jamesmcroft/ai-document-data-extraction-evaluation

This project demonstrates how to evaluate the use of LLMs and SLMs for extracting structured data from documents using .NET

azure document-extraction gpt llms openai phi slms

Last synced: 28 Oct 2025

https://github.com/jamesmcroft/document-data-extraction-prompt-flow-evaluation

This sample demonstrates how to use GPT-4o with Vision to extract structured JSON data from PDF documents and evaluate them with Azure AI Studio and Prompt Flow

azure document-extraction evaluation gpt-4o llms openai prompt-flow

Last synced: 07 Aug 2025

https://github.com/jamesmcroft/azure-ai-document-pipeline-python-sample

Python sample project for building scalable document data extraction pipeline with containerized Durable Functions and Azure AI Services on Azure Container Apps.

ai-services azure container-apps document-extraction durable-functions gpt-4o openai

Last synced: 28 Oct 2025

https://github.com/dashroshan/data-extractor

Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.

document-extraction form-analysis key-value-pairs ocr-python table-extraction

Last synced: 06 Apr 2025

https://github.com/ilejuxepwaduzd/structured-data-extractor

🛠️ Extract structured data from messy texts using Chain-of-Thought prompting to improve processing of customer support and technical issues.

cdp chrome-fetcher data document-extraction ecommerce golang-library headless metadata-extraction ocr open-source pdf pdf-converter pdf-extractor ruby scraper shopify spider structured-data

Last synced: 10 Apr 2026

https://github.com/jamesmcroft/azure-ai-document-pipeline-sample

.NET sample project for building a scalable document data extraction pipeline with containerized Durable Functions and Azure AI Services on Azure Container Apps.

ai-services azure container-apps document-extraction durable-functions gpt-4o openai

Last synced: 10 May 2026

https://github.com/subratamondal1/document-extraction

Document extraction from pdfs and images with OpenCV.

computer-vision document-extraction image-processing opencv py python3 pytorch

Last synced: 24 Jan 2026

https://github.com/hreikin/pdf-toolbox

Extract content from PDF's and convert or create new documents from the content in multiple output formats.

adobe document-conversion document-converter document-creation document-creator document-extraction image-extraction pandoc pymupdf pypandoc python python3 scrapy text-extraction

Last synced: 09 Jul 2025

https://github.com/agxp/docpulse

Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT

async document-extraction document-processing go gpt-4o json-schema llm multi-tenant ocr openai pdf postgresql rest-api structured-data tesseract worker

Last synced: 13 Mar 2026