An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with document-analysis

A curated list of projects in awesome lists tagged with document-analysis .

https://github.com/opendatalab/mineru

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python

Last synced: 06 Jan 2026

https://github.com/opendatalab/MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python

Last synced: 24 Mar 2025

https://github.com/Yuliang-Liu/Curve-Text-Detector

This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

deep-learning document-analysis object-detection scene-text

Last synced: 02 Apr 2025

https://github.com/yuliang-liu/curve-text-detector

This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

deep-learning document-analysis object-detection scene-text

Last synced: 04 Apr 2025

https://github.com/wenwenyu/PICK-pytorch

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

document-analysis document-understanding graph-convolutional-network graph-learning graph-neural-networks key-information-extraction

Last synced: 28 Apr 2025

https://github.com/lazyFrogLOL/llmdocparser

A package for parsing PDFs and analyzing their content using LLMs.

chunking document-analysis llm nlp ocr pdf-parser pdfparser rag text-chunking

Last synced: 01 Apr 2025

https://github.com/ispras/dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

doc document-analysis document-content-extraction documents docx docx-parser excel html html-parser logical-structure-extraction ocr odt pdf pdf-parser scanned-documents table-of-contents table-recognition txt

Last synced: 15 May 2025

https://github.com/mirabdullahyaser/retrieval-augmented-generation-engine-with-langchain-and-streamlit

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

artificial-intelligence chat-application document-analysis generative-ai gpt-3 langchain large-language-models natural-language-processing openai-chatgpt question-answering retrieval-augmented-generation streamlit

Last synced: 17 Aug 2025

https://github.com/xyntopia/pydoxtools

Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.

chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python

Last synced: 11 May 2025

https://github.com/zeninglin/vibertgrid-pytorch

An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"

document-ai document-analysis information-extraction key-information-extraction visual-information-extraction

Last synced: 07 Oct 2025

https://github.com/aws-solutions/enhanced-document-understanding-on-aws

Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.

document-analysis document-processing

Last synced: 17 Jul 2025

https://github.com/microsoft/synthetic-rag-index

Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.

azure document-analysis few-shot-learning large-language-model llm rag retrieval-augmented-generation serverless

Last synced: 20 Jun 2025

https://github.com/muhd-umer/pyramidtabnet

Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents

computer-vision deep-learning document-analysis implementation pytorch table-detection table-structure-recognition

Last synced: 13 Oct 2025

https://github.com/ad-freiburg/pdftotext-plus-plus

A fast and accurate command line tool for extracting text from PDF files.

c-plus-plus cli document-analysis metadata-extraction pdf text-extraction

Last synced: 16 May 2025

https://github.com/omni-us/research-contentdistillation-htr

Source code for ICFHR20 "Distilling Content from Style for Handwritten Word Recognition"

document-analysis generative-adversarial-network handwriting-recognition

Last synced: 23 Apr 2025

https://github.com/miku/grobidclient

A Go (golang) client for GROBID.

cli document-analysis golang grobid

Last synced: 11 Apr 2025

https://github.com/arsath-eng/rag1-nvidia-genai

A powerful Retrieval Augmented Generation (RAG) application built with NVIDIA AI endpoints and Streamlit. This solution enables intelligent document analysis and question-answering using state-of-the-art language models, featuring multi-PDF processing, FAISS vector store integration, and advanced prompt engineering.

document-analysis embeddings faiss langchain llama-models llm nvidia-ai-faundry pdf-processing question-answering rag streamlit vector-store

Last synced: 27 Oct 2025

https://github.com/bx0-0/cybervisionai

Cyber Vision AI is an award-winning, open-source AI assistant for cybersecurity, document analysis, and knowledge management. Built with advanced RAG, MindMap, and multi-agent AI, it empowers security professionals and researchers with unrestricted, ethical, and insightful tools.

ai chatbot cybersecurity django document-analysis gpt graduation-project llama llm markmap mindmap nlp ollama python rag speech-to-text streamlit text-to-speech x

Last synced: 29 Jun 2025

https://github.com/ksm26/dr-x-nlp-pipeline

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.

chromadb document-analysis llm local-llm modular-ai multilingual-ai-model nlp offlineai ollama opensource-ai rag textsummarization

Last synced: 23 Apr 2025

https://github.com/x1ao4/doc-merger

通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script

data-analysis data-merging document-analysis document-comparison document-processing documents filtering filtering-data merge merge-documents

Last synced: 28 Jun 2025

https://github.com/acsenrafilho/cucaracha

A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis

document-analysis document-classification document-processing optical-character-recognition python3

Last synced: 28 Oct 2025

https://github.com/diocrafts/ai-book-summarizer

📚 AI-Powered Book PDF Knowledge Extractor & Summarizer Transform your PDF books into structured knowledge effortlessly! This tool leverages AI to analyze books page by page, extracting key insights, definitions, and concepts, and organizes them into Markdown summaries for easier study

ai ai-powered-tools automation book-summary document-analysis educational-tools knowledge-extraction machine-learning markdown natural-language-processing openai pdf pdf-processing pdf-summarization pymupdf python study-materials text-analysis text-summarization

Last synced: 14 Jul 2025

https://github.com/rk-vashista/pitch

A modern web application that analyzes pitch decks using multi-agent AI technology. Upload your pitch deck and get comprehensive feedback on structure, content, and potential improvements!

ai ai-feedback crewai document-analysis document-analysis-tool fastapi langchain multi-agent nlp pitch-deck pitch-deck-analyzer pitch-evaluation python startup-tools websockets

Last synced: 03 Jul 2025

https://github.com/leg0shii/smart-documents

A web application that enables users to upload documents and utilize AI techniques like semantic search and text summarization for efficient analysis. Built with Python, FastAPI, Svelte, PostgreSQL, and LangChain.

ai document-analysis fastapi langchain semantic-search

Last synced: 26 Oct 2025

https://github.com/techycsr/ai-powered-document-insight-tool

AI-powered document analysis platform specializing in resume processing.

document-analysis gemmini resume-analysis typescript-application

Last synced: 22 Oct 2025

https://github.com/alinababer/data-science-and-insight-agent-rag-llama3-lava-llm-django-api

Data-Science-and-Insight-Agent-RAG-LLama3-Lava-LLM-Django-WebApplication is an advanced AI-driven chatbot designed to assist in data science, document analysis, and image interpretation. This repository contain the Django based rest apis of this project.

chatbot django document-analysis image-analysis large-language-models lava llama python redis-server rest-api retrival-augmented-generation visual-large-language-models

Last synced: 09 Nov 2025

https://github.com/coditheck/imgext

Image extraction from document.

document-analysis image-extractor python

Last synced: 25 Mar 2025

https://github.com/kazkozdev/researchify

🔬 Scientific chatbot that instantly searches arXiv.org papers, transforming an ocean of preprints into clear research insights. Powered by local LLMs from Ollama.

academic-tools api artificial-intelligence arxiv chatbot document-analysis document-processing llm machine-learning nlp nlp-machine-learning ollama paper-search rag research-assistant research-tools scientific scientific-computing scientific-papers

Last synced: 05 Apr 2025

https://github.com/alinababer/document-analysis-identification-with-rag-vector-database-and-mistral-llm

This Document Analysis pipeline is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning & AI models like RAG (Retrieval Augmented Generation) & Mistral LLM to efficiently extract, match, enrich, process document

document-analysis document-analysis-recognition document-pipeline document-uploader llm mistral paddleocr python rag tesseract

Last synced: 03 Apr 2025

https://github.com/edummorenolp/mindmanagerproject-ia

Sistema inteligente de gestión de proyectos de software con IA generativa. Plataforma full-stack para análisis automático de documentos, generación de estudios técnicos y gestión del ciclo de vida de proyectos usando React + Node.js + PostgreSQL + Google Gemini.

ai-powered artificial-intelligence document-analysis generative-ai github-pages google-gemini javascript llm-integration project-management project-planning reactjs software-development software-engineering vite workflow-automation

Last synced: 24 Dec 2025

https://github.com/ttwjoe/dr-x-nlp-pipeline

A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.

chromadb document-analysis llm local-llm modular-ai multilingual-ai-model nlp offlineai ollama opensource-ai rag textsummarization

Last synced: 30 Apr 2025

https://github.com/uzairsayyed-005/docuchat-ai

DocuChat-AI is an AI-powered document interaction assistant that transforms static PDFs into conversational partners. It leverages Retrieval-Augmented Generation (RAG), history-aware memory, and advanced NLP to enable natural language Q&A, contextual dialogue, and secure local document processing.

ai conversational-ai document-analysis fiass groq history-aware huggingface langchain local-processing pdf-processing rag streamlit

Last synced: 28 Mar 2025

https://github.com/giarcheuli/docparser

DocParser v2.0 - Project-Aware Document Analysis Tool with AI Integration

ai cli document-analysis llama2 markdown project-management python replicate

Last synced: 14 Oct 2025

https://github.com/jltk/briefgeist

Privacy-first desktop app for scanning, understanding and replying to letters.

automation document-analysis local-llm ocr python tesseract

Last synced: 27 Jun 2025

https://github.com/edummorenolp/projectmanagermind-ia

Sistema inteligente de gestión de proyectos de software con IA generativa. Plataforma full-stack para análisis automático de documentos, generación de estudios técnicos y gestión del ciclo de vida de proyectos usando React + Node.js + PostgreSQL + Google Gemini.

ai-powered artificial-intelligence document-analysis generative-ai github-pages google-gemini javascript llm-integration project-management project-planning reactjs software-development software-engineering vite workflow-automation

Last synced: 14 Oct 2025

https://github.com/veydantkatyal/doc-analysis

automatically extracts, summarizes, and analyzes PDF documents using Large Language Models (LLMs). It generates relevant questions and answers based on the document content for smarter understanding.

document-analysis huggingface-transformers llm

Last synced: 12 Apr 2025

https://github.com/dito97/neural-deskew

toolkit for learning efficient document image skew estimation (DISE)

deskewing document-analysis pytorch-2 self-supervised-learning

Last synced: 15 Oct 2025

https://github.com/shijincai/fast360

The industry's first "Open Source OCR Arena," a free, no-login utility for one-click benchmarking of 7 top-tier models (Marker, MinerU, MonkeyOCR, Docling, Dolphin, OCRFlux, PP-StructureV3) on your PDF/image files, specializing in PDF-to-Markdown conversion.

benchmark computer-vision data-extraction docling document-analysis document-parser evaluation latex latex-document machine-learning markdown-converter marker monkeyocr ocr ocr-service paddleocr pdf-converter pdf-to-markdown rag

Last synced: 30 Aug 2025

https://github.com/jamezycesar-collab/credit-agreement-chatbot

RAG-based chatbot for analyzing LSTA credit agreements using LangChain, OpenAI, and intelligent document processing. Automates covenant analysis and compliance checking.

ai chatbot credit-analysis document-analysis financial-analysis langchain llm openai python rag

Last synced: 11 Nov 2025