Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with document-analysis
A curated list of projects in awesome lists tagged with document-analysis .
https://github.com/opendatalab/mineru
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 16 Dec 2024
https://github.com/opendatalab/MinerU
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
ai4science document-analysis extract-data layout-analysis ocr parser pdf pdf-converter pdf-extractor-llm pdf-extractor-pretrain pdf-extractor-rag pdf-parser python
Last synced: 29 Oct 2024
https://github.com/uglytoad/pdfpig
Read and extract text and other content from PDFs in C# (port of PDFBox)
alto-xml csharp document-analysis hocr layout-analysis netstandard page-xml pdf pdf-document pdf-document-processor pdf-extractor pdf-files pdf-generation pdfbox
Last synced: 18 Dec 2024
https://github.com/UglyToad/PdfPig
Read and extract text and other content from PDFs in C# (port of PDFBox)
alto-xml csharp document-analysis hocr layout-analysis netstandard page-xml pdf pdf-document pdf-document-processor pdf-extractor pdf-files pdf-generation pdfbox
Last synced: 29 Oct 2024
https://github.com/alibabaresearch/advancedliteratemachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 31 Oct 2024
https://github.com/AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
artificial-intelligence computer-vision document document-analysis document-intelligence document-recognition document-understanding documentai end-to-end-ocr multimodal multimodal-deep-learning ocr scene-text-detection scene-text-detection-recognition scene-text-recognition text-detection text-recognition vision-language vision-language-model vision-language-transformer
Last synced: 07 Nov 2024
https://github.com/yuliang-liu/curve-text-detector
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
deep-learning document-analysis object-detection scene-text
Last synced: 21 Dec 2024
https://github.com/Yuliang-Liu/Curve-Text-Detector
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
deep-learning document-analysis object-detection scene-text
Last synced: 03 Nov 2024
https://github.com/wenwenyu/PICK-pytorch
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
document-analysis document-understanding graph-convolutional-network graph-learning graph-neural-networks key-information-extraction
Last synced: 11 Nov 2024
https://github.com/CybercentreCanada/assemblyline
AssemblyLine 4: File triage and malware analysis
assemblyline automation-framework cert cyber-security cybersecurity document-analysis file-analysis framework incident-response infosec malware malware-analysis malware-analyzer malware-detection malware-research python3 security-automation security-automation-framework security-tools
Last synced: 25 Oct 2024
https://github.com/cybercentrecanada/assemblyline
AssemblyLine 4: File triage and malware analysis
assemblyline automation-framework cert cyber-security cybersecurity document-analysis file-analysis framework incident-response infosec malware malware-analysis malware-analyzer malware-detection malware-research python3 security-automation security-automation-framework security-tools
Last synced: 20 Dec 2024
https://github.com/ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
doc document-analysis document-content-extraction documents docx docx-parser excel html html-parser logical-structure-extraction ocr odt pdf pdf-parser scanned-documents table-of-contents table-recognition txt
Last synced: 15 Dec 2024
https://github.com/masyagin1998/robin
RObust document image BINarization
computer-vision deep-learning document-analysis document-binarization keras neural-networks ocr opencv python u-net
Last synced: 21 Dec 2024
https://github.com/mirabdullahyaser/retrieval-augmented-generation-engine-with-langchain-and-streamlit
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
artificial-intelligence chat-application document-analysis generative-ai gpt-3 langchain large-language-models natural-language-processing openai-chatgpt question-answering retrieval-augmented-generation streamlit
Last synced: 15 Dec 2024
https://github.com/xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
chatgpt document-analysis document-extraction extraction information-retrieval llm nlp pdf python
Last synced: 17 Nov 2024
https://github.com/zeninglin/vibertgrid-pytorch
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
document-ai document-analysis information-extraction key-information-extraction visual-information-extraction
Last synced: 30 Oct 2024
https://github.com/aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
document-analysis document-processing
Last synced: 16 Nov 2024
https://github.com/abdur75648/utrnet-high-resolution-urdu-text-recognition
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
computer-vision deep-learning document-analysis high-resolution hrnet icdar icdar2023 machine-learning ocr pytorch scene-text-recognition text-detection text-recognition unet urdu urdu-nlp urdu-ocr urdu-synth utrnet
Last synced: 13 Dec 2024
https://github.com/microsoft/synthetic-rag-index
Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.
azure document-analysis few-shot-learning large-language-model llm rag retrieval-augmented-generation serverless
Last synced: 04 Dec 2024
https://github.com/muhd-umer/pyramidtabnet
Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents
computer-vision deep-learning document-analysis implementation pytorch table-detection table-structure-recognition
Last synced: 08 Nov 2024
https://github.com/ethanhezhao/MetaLDA
The code for MetaLDA in ICDM 2017
document-analysis icdm java machine-learning mallet metadata topic-modeling
Last synced: 13 Nov 2024
https://github.com/omni-us/research-contentdistillation-htr
Source code for ICFHR20 "Distilling Content from Style for Handwritten Word Recognition"
document-analysis generative-adversarial-network handwriting-recognition
Last synced: 08 Nov 2024
https://github.com/abdur75648/urdu-synth
High-quality synthetic text data generation for Urdu Text Recognition
computer-vision data-generation dataset deep-learning document-analysis icdar icdar2023 machine-learning ocr synthetic-data text-recognition urdu urdu-datasets urdu-nlp urdu-ocr urdu-synth utrnet
Last synced: 13 Dec 2024
https://github.com/arsath-eng/rag1-nvidia-genai
A powerful Retrieval Augmented Generation (RAG) application built with NVIDIA AI endpoints and Streamlit. This solution enables intelligent document analysis and question-answering using state-of-the-art language models, featuring multi-PDF processing, FAISS vector store integration, and advanced prompt engineering.
document-analysis embeddings faiss langchain llama-models llm nvidia-ai-faundry pdf-processing question-answering rag streamlit vector-store
Last synced: 20 Dec 2024
https://github.com/miku/grobidclient
A Go (golang) client for GROBID.
cli document-analysis golang grobid
Last synced: 24 Nov 2024
https://github.com/leg0shii/smart-documents
A web application that enables users to upload documents and utilize AI techniques like semantic search and text summarization for efficient analysis. Built with Python, FastAPI, Svelte, PostgreSQL, and LangChain.
ai document-analysis fastapi langchain semantic-search
Last synced: 29 Oct 2024
https://github.com/x1ao4/doc-merger
通过 python 脚本将两个相对不完整的文档合并为一个完整的文档 / merge two relatively incomplete documents into one complete document via python script
data-analysis data-merging document-analysis document-comparison document-processing documents filtering filtering-data merge merge-documents
Last synced: 08 Nov 2024
https://github.com/coditheck/imgext
Image extraction from document.
document-analysis image-extractor python
Last synced: 03 Dec 2024
https://github.com/alinababer/document-analysis-identification-with-rag-vector-database-and-mistral-llm
This Document Analysis pipeline is a comprehensive document analysis system, designed to automate the processing and analysis of documents from acquisition to consumption. It integrates advanced machine learning & AI models like RAG (Retrieval Augmented Generation) & Mistral LLM to efficiently extract, match, enrich, process document
document-analysis document-analysis-recognition document-pipeline document-uploader llm mistral paddleocr python rag tesseract
Last synced: 16 Dec 2024
https://github.com/dito97/neural-deskew
toolkit for learning efficient document image skew estimation (DISE)
deskewing document-analysis pytorch-2 self-supervised-learning
Last synced: 06 Dec 2024
https://github.com/alinababer/data-science-and-insight-agent-rag-llama3-lava-llm-django-api
Data-Science-and-Insight-Agent-RAG-LLama3-Lava-LLM-Django-WebApplication is an advanced AI-driven chatbot designed to assist in data science, document analysis, and image interpretation. This repository contain the Django based rest apis of this project.
chatbot django document-analysis image-analysis large-language-models lava llama python redis-server rest-api retrival-augmented-generation visual-large-language-models
Last synced: 19 Nov 2024