Projects in Awesome Lists tagged with document-extraction
A curated list of projects in awesome lists tagged with document-extraction .
https://github.com/harishdeivanayagam/rowfill
Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers
document document-extraction document-parsing image-ocr langgraph llama llm nextjs ocr ocr-javascript ollama openai pdf pdfs unstructured unstructured-data vision vision-api
Last synced: 13 Apr 2025
https://github.com/alephdata/ingest-file
Ingestors extract the contents of mixed unstructured documents into structured (followthemoney) data.
document-extraction documents email-forensics excel forensics forensics-investigations metadata-extraction ocr
Last synced: 07 May 2025
https://github.com/konfuzio-ai/konfuzio-sdk
Run OCR, extract information from documents and classify them. In addition, annotate documents and build custom NLP and computer vision models tailored for your specific use cases. Find examples with code in our Tutorials section of dev.konfuzio.com and get inspiration from Use Cases section of our blog: https://konfuzio.com/en/category/marketplace
computer-vision document-annotate document-annotation document-annotation-tool document-extraction nlp ocr python text-classification text-processing
Last synced: 08 Aug 2025
https://github.com/xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python
Last synced: 11 May 2025
https://github.com/kreuzberg-dev/kreuzberg-cloud
Cloud-native document extraction platform — SaaS at kreuzberg.dev or self-host on any Kubernetes cluster. 90+ formats, REST API, webhooks. Built on Kreuzberg.
api axum busl cloud-native document-extraction document-processing helm kreuzberg kubernetes microservices nats nextjs ocr pdf postgresql rust saas self-hosted text-extraction
Last synced: 05 Jun 2026
https://github.com/aws-samples/sample-badgers
Guidance on deploying a generative AI document analysis with Amazon Bedrock AgentCore. Auto-classifies, enhances, and aggregates multi-type documents using Gestalt-informed vision prompts. Custom analyzer creation wizard. Scripted CDK deployment. Gradio frontend included.
agentcore agentcore-sdk agentic-ai agentic-workflow amazon-nova badgers cdk claude composable-prompts document-extraction document-intelligence document-vision full-text-extraction gestalt prompt-engineering strands-agent-sdk strands-agents vision-models
Last synced: 14 Apr 2026
https://github.com/tammilore/ai-contract-analyzer
AI-powered contract analysis tool
ai document-extraction llms open-source
Last synced: 17 Jul 2025
https://github.com/jamesmcroft/ai-document-data-extraction-evaluation
This project demonstrates how to evaluate the use of LLMs and SLMs for extracting structured data from documents using .NET
azure document-extraction gpt llms openai phi slms
Last synced: 28 Oct 2025
https://github.com/jamesmcroft/document-data-extraction-prompt-flow-evaluation
This sample demonstrates how to use GPT-4o with Vision to extract structured JSON data from PDF documents and evaluate them with Azure AI Studio and Prompt Flow
azure document-extraction evaluation gpt-4o llms openai prompt-flow
Last synced: 07 Aug 2025
https://github.com/jamesmcroft/azure-ai-document-pipeline-python-sample
Python sample project for building scalable document data extraction pipeline with containerized Durable Functions and Azure AI Services on Azure Container Apps.
ai-services azure container-apps document-extraction durable-functions gpt-4o openai
Last synced: 28 Oct 2025
https://github.com/dashroshan/data-extractor
Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.
document-extraction form-analysis key-value-pairs ocr-python table-extraction
Last synced: 06 Apr 2025
https://github.com/ilejuxepwaduzd/structured-data-extractor
🛠️ Extract structured data from messy texts using Chain-of-Thought prompting to improve processing of customer support and technical issues.
cdp chrome-fetcher data document-extraction ecommerce golang-library headless metadata-extraction ocr open-source pdf pdf-converter pdf-extractor ruby scraper shopify spider structured-data
Last synced: 10 Apr 2026
https://github.com/jamesmcroft/azure-ai-document-pipeline-sample
.NET sample project for building a scalable document data extraction pipeline with containerized Durable Functions and Azure AI Services on Azure Container Apps.
ai-services azure container-apps document-extraction durable-functions gpt-4o openai
Last synced: 10 May 2026
https://github.com/subratamondal1/document-extraction
Document extraction from pdfs and images with OpenCV.
computer-vision document-extraction image-processing opencv py python3 pytorch
Last synced: 24 Jan 2026
https://github.com/sensible-hq/tutorial-pdf-to-excel
Converts a PDF file to Excel.
document-extraction excel extraction pdf python
Last synced: 03 Apr 2025
https://github.com/hreikin/pdf-toolbox
Extract content from PDF's and convert or create new documents from the content in multiple output formats.
adobe document-conversion document-converter document-creation document-creator document-extraction image-extraction pandoc pymupdf pypandoc python python3 scrapy text-extraction
Last synced: 09 Jul 2025
https://github.com/pmthetechguy/document-entity-extractor
AI-powered document extractor for names, emails, and organizations.
ai automation data-extraction document-extraction entity-recognition fastapi gpt openai pandas portfolio-project python uvicorn web-app
Last synced: 16 Apr 2026
https://github.com/agxp/docpulse
Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT
async document-extraction document-processing go gpt-4o json-schema llm multi-tenant ocr openai pdf postgresql rest-api structured-data tesseract worker
Last synced: 13 Mar 2026