Projects in Awesome Lists tagged with document-analysis
A curated list of projects in awesome lists tagged with document-analysis .
https://github.com/opendatalab/mineru
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 06 Jan 2026
https://github.com/opendatalab/MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 24 Mar 2025
https://github.com/uglytoad/pdfpig
Read and extract text and other content from PDFs in C# (port of PDFBox)
alto-xml csharp document-analysis hocr layout-analysis netstandard page-xml pdf pdf-document pdf-document-processor pdf-extractor pdf-files pdf-generation pdfbox
Last synced: 10 May 2025
https://github.com/UglyToad/PdfPig
Read and extract text and other content from PDFs in C# (port of PDFBox)
alto-xml csharp document-analysis hocr layout-analysis netstandard page-xml pdf pdf-document pdf-document-processor pdf-extractor pdf-files pdf-generation pdfbox
Last synced: 24 Mar 2025
https://github.com/AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 10 Apr 2025
https://github.com/alibabaresearch/advancedliteratemachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 28 Mar 2025
https://github.com/Yuliang-Liu/Curve-Text-Detector
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
deep-learning document-analysis object-detection scene-text
Last synced: 02 Apr 2025
https://github.com/yuliang-liu/curve-text-detector
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
deep-learning document-analysis object-detection scene-text
Last synced: 04 Apr 2025
https://github.com/wenwenyu/PICK-pytorch
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
document-analysis document-understanding graph-convolutional-network graph-learning graph-neural-networks key-information-extraction
Last synced: 28 Apr 2025
https://github.com/cybercentrecanada/assemblyline
AssemblyLine 4: File triage and malware analysis
assemblyline automation-framework cert cyber-security cybersecurity document-analysis file-analysis framework incident-response infosec malware malware-analysis malware-analyzer malware-detection malware-research python3 security-automation security-automation-framework security-tools
Last synced: 06 Jan 2026
https://github.com/lazyFrogLOL/llmdocparser
A package for parsing PDFs and analyzing their content using LLMs.
chunking document-analysis llm nlp ocr pdf-parser pdfparser rag text-chunking
Last synced: 01 Apr 2025
https://github.com/CybercentreCanada/assemblyline
AssemblyLine 4: File triage and malware analysis
assemblyline automation-framework cert cyber-security cybersecurity document-analysis file-analysis framework incident-response infosec malware malware-analysis malware-analyzer malware-detection malware-research python3 security-automation security-automation-framework security-tools
Last synced: 14 Mar 2025
https://github.com/ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
doc document-analysis document-content-extraction documents docx docx-parser excel html html-parser logical-structure-extraction ocr odt pdf pdf-parser scanned-documents table-of-contents table-recognition txt
Last synced: 15 May 2025
https://github.com/masyagin1998/robin
RObust document image BINarization
computer-vision deep-learning document-analysis document-binarization keras neural-networks ocr opencv python u-net
Last synced: 10 Apr 2025
https://github.com/mirabdullahyaser/retrieval-augmented-generation-engine-with-langchain-and-streamlit
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
artificial-intelligence chat-application document-analysis generative-ai gpt-3 langchain large-language-models natural-language-processing openai-chatgpt question-answering retrieval-augmented-generation streamlit
Last synced: 17 Aug 2025
https://github.com/xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python
Last synced: 11 May 2025
https://github.com/zeninglin/vibertgrid-pytorch
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
document-ai document-analysis information-extraction key-information-extraction visual-information-extraction
Last synced: 07 Oct 2025
https://github.com/abdur75648/utrnet-high-resolution-urdu-text-recognition
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
computer-vision deep-learning document-analysis high-resolution hrnet icdar icdar2023 machine-learning ocr pytorch scene-text-recognition text-detection text-recognition unet urdu urdu-nlp urdu-ocr urdu-synth utrnet
Last synced: 24 Aug 2025
https://github.com/jpleorx/detectron2-publaynet
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
artificial-intelligence computer-vision deep-learning detectron2 document-analysis document-classification document-layout document-layout-analysis faster-rcnn instance-segmentation layout-analysis machine-learning neural-network neural-networks object-detection publaynet python python3 pytorch
Last synced: 10 May 2025
https://github.com/aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
document-analysis document-processing
Last synced: 17 Jul 2025
https://github.com/microsoft/synthetic-rag-index
Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.
azure document-analysis few-shot-learning large-language-model llm rag retrieval-augmented-generation serverless
Last synced: 20 Jun 2025
https://github.com/muhd-umer/pyramidtabnet
Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents
computer-vision deep-learning document-analysis implementation pytorch table-detection table-structure-recognition
Last synced: 13 Oct 2025
https://github.com/ad-freiburg/pdftotext-plus-plus
A fast and accurate command line tool for extracting text from PDF files.
c-plus-plus cli document-analysis metadata-extraction pdf text-extraction
Last synced: 16 May 2025
https://github.com/ethanhezhao/MetaLDA
The code for MetaLDA in ICDM 2017
document-analysis icdm java machine-learning mallet metadata topic-modeling
Last synced: 03 May 2025
https://github.com/aidayang/mineru-oneclick
MinerU免安装部署一键启动整合包
ai4science document-analysis extract-data layout-analysis markdown mineru ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser pdftojson pdftomarkdown python
Last synced: 12 Jul 2025
https://github.com/omni-us/research-contentdistillation-htr
Source code for ICFHR20 "Distilling Content from Style for Handwritten Word Recognition"
document-analysis generative-adversarial-network handwriting-recognition
Last synced: 23 Apr 2025
https://github.com/abdur75648/urdu-synth
High-quality synthetic text data generation for Urdu Text Recognition
computer-vision data-generation dataset deep-learning document-analysis icdar icdar2023 machine-learning ocr synthetic-data text-recognition urdu urdu-datasets urdu-nlp urdu-ocr urdu-synth utrnet
Last synced: 30 Apr 2025
https://github.com/miku/grobidclient
A Go (golang) client for GROBID.
cli document-analysis golang grobid
Last synced: 11 Apr 2025
https://github.com/arsath-eng/rag1-nvidia-genai
A powerful Retrieval Augmented Generation (RAG) application built with NVIDIA AI endpoints and Streamlit. This solution enables intelligent document analysis and question-answering using state-of-the-art language models, featuring multi-PDF processing, FAISS vector store integration, and advanced prompt engineering.
document-analysis embeddings faiss langchain llama-models llm nvidia-ai-faundry pdf-processing question-answering rag streamlit vector-store
Last synced: 27 Oct 2025
https://github.com/bx0-0/cybervisionai
Cyber Vision AI is an award-winning, open-source AI assistant for cybersecurity, document analysis, and knowledge management. Built with advanced RAG, MindMap, and multi-agent AI, it empowers security professionals and researchers with unrestricted, ethical, and insightful tools.
ai chatbot cybersecurity django document-analysis gpt graduation-project llama llm markmap mindmap nlp ollama python rag speech-to-text streamlit text-to-speech x
Last synced: 29 Jun 2025
https://github.com/ksm26/dr-x-nlp-pipeline
A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.
chromadb document-analysis llm local-llm modular-ai multilingual-ai-model nlp offlineai ollama opensource-ai rag textsummarization
Last synced: 23 Apr 2025
https://github.com/x1ao4/doc-merger
通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script
data-analysis data-merging document-analysis document-comparison document-processing documents filtering filtering-data merge merge-documents
Last synced: 28 Jun 2025
https://github.com/acsenrafilho/cucaracha
A bureaucratic cockroach (cucaracha) assistent to help in document processing and analysis
document-analysis document-classification document-processing optical-character-recognition python3
Last synced: 28 Oct 2025
https://github.com/diocrafts/ai-book-summarizer
📚 AI-Powered Book PDF Knowledge Extractor & Summarizer Transform your PDF books into structured knowledge effortlessly! This tool leverages AI to analyze books page by page, extracting key insights, definitions, and concepts, and organizes them into Markdown summaries for easier study
ai ai-powered-tools automation book-summary document-analysis educational-tools knowledge-extraction machine-learning markdown natural-language-processing openai pdf pdf-processing pdf-summarization pymupdf python study-materials text-analysis text-summarization
Last synced: 14 Jul 2025
https://github.com/rk-vashista/pitch
A modern web application that analyzes pitch decks using multi-agent AI technology. Upload your pitch deck and get comprehensive feedback on structure, content, and potential improvements!
ai ai-feedback crewai document-analysis document-analysis-tool fastapi langchain multi-agent nlp pitch-deck pitch-deck-analyzer pitch-evaluation python startup-tools websockets
Last synced: 03 Jul 2025
https://github.com/leg0shii/smart-documents
A web application that enables users to upload documents and utilize AI techniques like semantic search and text summarization for efficient analysis. Built with Python, FastAPI, Svelte, PostgreSQL, and LangChain.
ai document-analysis fastapi langchain semantic-search
Last synced: 26 Oct 2025
https://github.com/techycsr/ai-powered-document-insight-tool
AI-powered document analysis platform specializing in resume processing.
document-analysis gemmini resume-analysis typescript-application
Last synced: 22 Oct 2025
https://github.com/alinababer/data-science-and-insight-agent-rag-llama3-lava-llm-django-api
Data-Science-and-Insight-Agent-RAG-LLama3-Lava-LLM-Django-WebApplication is an advanced AI-driven chatbot designed to assist in data science, document analysis, and image interpretation. This repository contain the Django based rest apis of this project.
chatbot django document-analysis image-analysis large-language-models lava llama python redis-server rest-api retrival-augmented-generation visual-large-language-models
Last synced: 09 Nov 2025
https://github.com/bylickilabs/pdfanalyzer
PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.
automation cli document-analysis document-processing file-analyzer file-inspector metadata open-source pdf pdf-analysis pdf-extraction python reporting streamlit text-mining
Last synced: 04 Jul 2025
https://github.com/sourceduty/opinionated_analysis_report
📄 Create an opinionated analysis report of a document, influenced by personality traits.
concept document-analysis document-opinion idea opinion opinionated opinionated-analysis personality personality-profile personality-traits programmed-personality psychological-programming
Last synced: 08 Aug 2025
https://github.com/coditheck/imgext
Image extraction from document.
document-analysis image-extractor python
Last synced: 25 Mar 2025
https://github.com/kazkozdev/researchify
🔬 Scientific chatbot that instantly searches arXiv.org papers, transforming an ocean of preprints into clear research insights. Powered by local LLMs from Ollama.
academic-tools api artificial-intelligence arxiv chatbot document-analysis document-processing llm machine-learning nlp nlp-machine-learning ollama paper-search rag research-assistant research-tools scientific scientific-computing scientific-papers
Last synced: 05 Apr 2025
https://github.com/alinababer/document-analysis-identification-with-rag-vector-database-and-mistral-llm
This Document Analysis pipeline is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning & AI models like RAG (Retrieval Augmented Generation) & Mistral LLM to efficiently extract, match, enrich, process document
document-analysis document-analysis-recognition document-pipeline document-uploader llm mistral paddleocr python rag tesseract
Last synced: 03 Apr 2025
https://github.com/edummorenolp/mindmanagerproject-ia
Sistema inteligente de gestión de proyectos de software con IA generativa. Plataforma full-stack para análisis automático de documentos, generación de estudios técnicos y gestión del ciclo de vida de proyectos usando React + Node.js + PostgreSQL + Google Gemini.
ai-powered artificial-intelligence document-analysis generative-ai github-pages google-gemini javascript llm-integration project-management project-planning reactjs software-development software-engineering vite workflow-automation
Last synced: 24 Dec 2025
https://github.com/ttwjoe/dr-x-nlp-pipeline
A fully offline NLP pipeline for extracting, chunking, embedding, querying, summarizing, and translating research documents using local LLMs. Inspired by the fictional mystery of Dr. X, the system supports multi-format files, local RAG-based Q&A, Arabic translation, and ROUGE-based summarization — all without cloud dependencies.
chromadb document-analysis llm local-llm modular-ai multilingual-ai-model nlp offlineai ollama opensource-ai rag textsummarization
Last synced: 30 Apr 2025
https://github.com/uzairsayyed-005/docuchat-ai
DocuChat-AI is an AI-powered document interaction assistant that transforms static PDFs into conversational partners. It leverages Retrieval-Augmented Generation (RAG), history-aware memory, and advanced NLP to enable natural language Q&A, contextual dialogue, and secure local document processing.
ai conversational-ai document-analysis fiass groq history-aware huggingface langchain local-processing pdf-processing rag streamlit
Last synced: 28 Mar 2025
https://github.com/giarcheuli/docparser
DocParser v2.0 - Project-Aware Document Analysis Tool with AI Integration
ai cli document-analysis llama2 markdown project-management python replicate
Last synced: 14 Oct 2025
https://github.com/jltk/briefgeist
Privacy-first desktop app for scanning, understanding and replying to letters.
automation document-analysis local-llm ocr python tesseract
Last synced: 27 Jun 2025
https://github.com/edummorenolp/projectmanagermind-ia
Sistema inteligente de gestión de proyectos de software con IA generativa. Plataforma full-stack para análisis automático de documentos, generación de estudios técnicos y gestión del ciclo de vida de proyectos usando React + Node.js + PostgreSQL + Google Gemini.
ai-powered artificial-intelligence document-analysis generative-ai github-pages google-gemini javascript llm-integration project-management project-planning reactjs software-development software-engineering vite workflow-automation
Last synced: 14 Oct 2025
https://github.com/veydantkatyal/doc-analysis
automatically extracts, summarizes, and analyzes PDF documents using Large Language Models (LLMs). It generates relevant questions and answers based on the document content for smarter understanding.
document-analysis huggingface-transformers llm
Last synced: 12 Apr 2025
https://github.com/dito97/neural-deskew
toolkit for learning efficient document image skew estimation (DISE)
deskewing document-analysis pytorch-2 self-supervised-learning
Last synced: 15 Oct 2025
https://github.com/shijincai/fast360
The industry's first "Open Source OCR Arena," a free, no-login utility for one-click benchmarking of 7 top-tier models (Marker, MinerU, MonkeyOCR, Docling, Dolphin, OCRFlux, PP-StructureV3) on your PDF/image files, specializing in PDF-to-Markdown conversion.
benchmark computer-vision data-extraction docling document-analysis document-parser evaluation latex latex-document machine-learning markdown-converter marker monkeyocr ocr ocr-service paddleocr pdf-converter pdf-to-markdown rag
Last synced: 30 Aug 2025
https://github.com/jamezycesar-collab/credit-agreement-chatbot
RAG-based chatbot for analyzing LSTA credit agreements using LangChain, OpenAI, and intelligent document processing. Automates covenant analysis and compliance checking.
ai chatbot credit-analysis document-analysis financial-analysis langchain llm openai python rag
Last synced: 11 Nov 2025