Projects in Awesome Lists tagged with document-intelligence
A curated list of projects in awesome lists tagged with document-intelligence .
https://github.com/paddlepaddle/paddlenlp
Easy-to-use and powerful LLM and SLM library with awesome model zoo.
bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie
Last synced: 09 Sep 2025
https://github.com/PaddlePaddle/PaddleNLP
π Easy-to-use and powerful NLP and LLM library with π€ Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including πText Classification, π Neural Search, β Question Answering, βΉοΈ Information Extraction, π Document Intelligence, π Sentiment Analysis etc.
bert compression distributed-training document-intelligence embedding ernie information-extraction llama llm neural-search nlp paddlenlp pretrained-models question-answering search-engine semantic-analysis sentiment-analysis transformers uie
Last synced: 18 Mar 2025
https://github.com/kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
document-intelligence elixir ffi golang java metadata-extraction node pdf-extraction pdfium php python rag ruby rust table-extraction tesseract text-extraction wasm
Last synced: 07 Feb 2026
https://github.com/Goldziher/kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
async document-intelligence mcp metadata-extraction ocr pandoc pdf-extraction pdfium python rag table-extraction tesseract text-extraction
Last synced: 21 Oct 2025
https://github.com/alibabaresearch/advancedliteratemachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 28 Mar 2025
https://github.com/AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 10 Apr 2025
https://github.com/enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python
Last synced: 04 Apr 2025
https://github.com/enoch3712/extractthinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python
Last synced: 14 May 2025
https://github.com/azure/ai-in-a-box
AI-in-a-Box leverages the expertise of Microsoft across the globe to develop and provide AI and ML solutions to the technical community. Our intent is to present a curated collection of solution accelerators that can help engineers establish their AI/ML environments and solutions rapidly and with minimal friction.
ai azd azd-templates azure chat-bot chatbot chatgpt custom-vision document-intelligence edge-ai edge-computing langchain machine-learning openai semantic-kernel
Last synced: 14 Apr 2025
https://github.com/Azure/AI-in-a-Box
AI-in-a-Box leverages the expertise of Microsoft across the globe to develop and provide AI and ML solutions to the technical community. Our intent is to present a curated collection of solution accelerators that can help engineers establish their AI/ML environments and solutions rapidly and with minimal friction.
ai azd azd-templates azure chat-bot chatbot chatgpt custom-vision document-intelligence edge-ai edge-computing langchain machine-learning openai semantic-kernel
Last synced: 25 Mar 2025
https://github.com/shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
ai contract-analysis data-extraction document-intelligence docx docx2md docx2txt generative-ai legaltech llm llm-extraction llm-framework llm-pipeline llms nlp prompt-engineering text-analysis unstructured-data
Last synced: 13 May 2025
https://github.com/doc-analysis/readingbank
ReadingBank: A Benchmark Dataset for Reading Order Detection
document-ai document-intelligence document-understanding natural-language-processing nlp ocr
Last synced: 04 Jan 2026
https://github.com/jamesmcroft/azure-document-intelligence-markdown-to-openai-data-extraction-sample
This sample demonstrates how to use Document Intelligence's Layout model to convert a PDF document, such as invoices, into Markdown, then use GPT-3.5 Turbo to extract structured JSON data using the Azure OpenAI Service.
azure document-intelligence gpt openai
Last synced: 26 Oct 2025
https://github.com/jamesmcroft/document-intelligence-user-feedback-processor
An experiment to provide the capabilities of Azure AI Document Intelligence Studio template training for feedback loop
ai azure document-intelligence mlops
Last synced: 14 Jun 2025
https://github.com/msusazureaccelerators/southreusableassets
IP and use case assets for CSU
azure-ai-search azure-ai-services azure-ml azure-openai csharp-code csharp-docs document-intelligence json-files knowledge-graph mlflow openai-assistant-api openai-chat-api prompt-flow python-code python-docs semantic-kernel yaml-files
Last synced: 06 Feb 2026
https://github.com/yuvaraj3855/preocr
Fast document classification and OCR detection. Analyzes any file type to determine if OCR is needed, saving time and money on unnecessary processing.
computer-vision document-analysis document-classification document-intelligence document-processing document-understanding file-analysis image-processing layout-analysis ocr ocr-detection opencv pdf pdf-analysis pdf-parsing preprocessing python python-library text-detection text-extraction
Last synced: 06 Feb 2026
https://github.com/ks6088ts-labs/azure-ai-services-solutions
A collection of solutions that leverage Azure AI services.
azure azure-ai-services azure-event-grid azure-functions azure-storage cosmosdb document-intelligence fastapi langchain langgraph openai poetry python streamlit typer
Last synced: 04 Jan 2026
https://github.com/joinalahmed/invoiceparsingwithaoai
Using Azure Document Intelligence and Azure OpenAI services to automatically extract data from invoices.
aoai azure document-intelligence invoice-parser
Last synced: 20 Aug 2025
https://github.com/jasjeev013/neuroquery-chroma-rag
NeuroQuery is an AI-powered PDF question-answering system that lets you upload and interact with documents using natural language. Built with LangChain, Gemini AI, and Chroma, it delivers fast, context-aware answers from your files.
ai chromadb document-intelligence gemini langchain multi-pdf-processing nlp pdf-analysis-python pdf-question-answering streamlit vector-search
Last synced: 01 Aug 2025
https://github.com/msaleh1888/azure-serverless-invoice-extraction
Serverless invoice extraction API using Azure Document Intelligence and Azure Functions. Upload a PDF invoice and receive normalized JSON output including line items, totals, dates, and vendor details.
ai-engineering architecture azure-ai azure-document-intelligence azure-functions backend cloud-engineering cloud-functions document-intelligence form-recognizer http-trigger invoice-processing microservice ocr pdf-processing pdf-to-json python rest-api serverless serverless-architecture
Last synced: 13 Jan 2026