An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with document-extraction

A curated list of projects in awesome lists tagged with document-extraction .

https://github.com/harishdeivanayagam/rowfill

Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

document document-extraction document-parsing image-ocr langgraph llama llm nextjs ocr ocr-javascript ollama openai pdf pdfs unstructured unstructured-data vision vision-api

Last synced: 13 Apr 2025

https://github.com/alephdata/ingest-file

Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.

document-extraction documents email-forensics excel forensics forensics-investigations metadata-extraction ocr

Last synced: 07 May 2025

https://github.com/konfuzio-ai/konfuzio-sdk

Run OCR, extract information from documents and classify them. In addition, annotate documents and build custom NLP and computer vision models tailored for your specific use cases. Find examples with code in our Tutorials section of dev.konfuzio.com and get inspiration from Use Cases section of our blog: https://konfuzio.com/en/category/marketplace

computer-vision document-annotate document-annotation document-annotation-tool document-extraction nlp ocr python text-classification text-processing

Last synced: 08 Aug 2025

https://github.com/xyntopia/pydoxtools

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python

Last synced: 11 May 2025

https://github.com/tammilore/ai-contract-analyzer

AI-powered contract analysis tool

ai document-extraction llms open-source

Last synced: 17 Jul 2025

https://github.com/jamesmcroft/ai-document-data-extraction-evaluation

This project demonstrates how to evaluate the use of LLMs and SLMs for extracting structured data from documents using .NET

azure document-extraction gpt llms openai phi slms

Last synced: 28 Oct 2025

https://github.com/jamesmcroft/document-data-extraction-prompt-flow-evaluation

This sample demonstrates how to use GPT-4o with Vision to extract structured JSON data from PDF documents and evaluate them with Azure AI Studio and Prompt Flow

azure document-extraction evaluation gpt-4o llms openai prompt-flow

Last synced: 07 Aug 2025

https://github.com/jamesmcroft/azure-ai-document-pipeline-python-sample

Python sample project for building scalable document data extraction pipeline with containerized Durable Functions and Azure AI Services on Azure Container Apps.

ai-services azure container-apps document-extraction durable-functions gpt-4o openai

Last synced: 28 Oct 2025

https://github.com/dashroshan/data-extractor

Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.

document-extraction form-analysis key-value-pairs ocr-python table-extraction

Last synced: 06 Apr 2025

https://github.com/jamesmcroft/azure-ai-document-pipeline-sample

.NET sample project for building a scalable document data extraction pipeline with containerized Durable Functions and Azure AI Services on Azure Container Apps.

ai-services azure container-apps document-extraction durable-functions gpt-4o openai

Last synced: 17 Jul 2025

https://github.com/ilejuxepwaduzd/structured-data-extractor

🛠️ Extract structured data from messy texts using Chain-of-Thought prompting to improve processing of customer support and technical issues.

cdp chrome-fetcher data document-extraction ecommerce golang-library headless metadata-extraction ocr open-source pdf pdf-converter pdf-extractor ruby scraper shopify spider structured-data

Last synced: 09 Oct 2025

https://github.com/subratamondal1/document-extraction

Document extraction from pdfs and images with OpenCV.

computer-vision document-extraction image-processing opencv py python3 pytorch

Last synced: 24 Jan 2026

https://github.com/hreikin/pdf-toolbox

Extract content from PDF's and convert or create new documents from the content in multiple output formats.

adobe document-conversion document-converter document-creation document-creator document-extraction image-extraction pandoc pymupdf pypandoc python python3 scrapy text-extraction

Last synced: 09 Jul 2025