Projects in Awesome Lists tagged with document-parser
A curated list of projects in awesome lists tagged with document-parser .
https://github.com/infiniflow/ragflow
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
agent agents ai-search chatbot chatgpt deep-learning deepseek deepseek-r1 document-parser document-understanding genai graphrag llm nlp ollama pdf-to-text rag retrieval-augmented-generation table-structure-recognition text2sql
Last synced: 12 May 2025
https://docling-project.github.io/docling/
Get your documents ready for gen AI
ai convert document-parser document-parsing documents docx html markdown pdf pdf-converter pdf-to-json pdf-to-text pptx tables xlsx
Last synced: 26 Jun 2025
https://github.com/docling-project/docling
Get your documents ready for gen AI
ai convert document-parser document-parsing documents docx html markdown pdf pdf-converter pdf-to-json pdf-to-text pptx tables xlsx
Last synced: 09 Sep 2025
https://github.com/ds4sd/docling
Get your documents ready for gen AI
ai convert document-parser document-parsing documents docx html markdown pdf pdf-converter pdf-to-json pdf-to-text pptx tables xlsx
Last synced: 08 Mar 2025
https://ds4sd.github.io/docling/
Get your documents ready for gen AI
ai convert document-parser document-parsing documents docx html markdown pdf pdf-converter pdf-to-json pdf-to-text pptx tables xlsx
Last synced: 07 Sep 2025
https://github.com/unstructured-io/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
Last synced: 12 Jan 2026
https://github.com/Unstructured-IO/unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
data-pipelines deep-learning document-image-analysis document-image-processing document-parser document-parsing docx donut information-retrieval langchain llm machine-learning ml natural-language-processing nlp ocr pdf pdf-to-json pdf-to-text preprocessing
Last synced: 26 Mar 2025
https://github.com/run-llama/llama_cloud_services
Knowledge Agents and Management in the Cloud
document document-parser document-parsing docx-to-markdown parsing pdf pdf-document-processor pdf-to-excel pdf-to-json pdf-to-markdown pdf-to-text ppt-to-json ppt-to-markdown pptx structured-data tables
Last synced: 15 May 2025
https://github.com/marker-inc-korea/autorag
AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation
analysis automl benchmarking document-parser embeddings evaluation llm llm-evaluation llm-ops open-source ops optimization pipeline python qa rag rag-evaluation retrieval-augmented-generation
Last synced: 12 May 2025
https://github.com/filimoa/open-parse
Improved file parsing for LLM’s
document-parser document-structure layout-parsing table-detection
Last synced: 19 Oct 2025
https://github.com/Filimoa/open-parse
Improved file parsing for LLM’s
document-parser document-structure layout-parsing table-detection
Last synced: 04 Apr 2025
https://github.com/deepdoctection/deepdoctection
A Repo For Document AI
document-ai document-image-analysis document-layout-analysis document-parser document-understanding layoutlm nlp ocr publaynet pubtabnet python pytorch table-detection table-recognition tensorflow
Last synced: 04 Jan 2026
https://github.com/opendataloader-project/opendataloader-pdf
PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU
ai dataloader document-parser document-parsing documents html json markdown ocr-recognition pdf pdf-converter pdf-parser pdf-to-html pdf-to-json pdf-to-markdown rag recognition sdk tables
Last synced: 20 Jan 2026
https://github.com/iamarunbrahma/vision-parse
Parse PDFs into markdown using Vision LLMs
document-parser pdf-parser pdf-to-markdown text-extraction
Last synced: 13 Dec 2025
https://github.com/jpleorx/opencv-text-deskew
Tutorial on how to deskew (straighten) text images
computer-vision deskew document-parser image-processing opencv opencv-python python tutorial
Last synced: 10 May 2025
https://github.com/lianjiatech/bella-domify
文档解析(Document Parser),支持 PDF、TXT、DOC、DOCX、Markdown 等文件格式,高效提取与解析内容,生成标准文档树结构。内置 PDF Parser、Text Parser、Word Parser,助力 RAG、知识库、全文检索等智能应用。
document-parser parser pdf-parser
Last synced: 04 Oct 2025
https://github.com/decisionfacts/semantic-ai
An open source framework for Retrieval-Augmented System (RAG) uses semantic search helps to retrieve the expected results and generate human readable conversational response with the help of LLM (Large Language Model).
approximate-nearest-neighbor-search deep-neural-networks document-parser docx fastapi inference-api llama2 llm machine-learning ocr openai openai-api pdf rag retrieval-augmented-generation semantic-search vector-database
Last synced: 27 Jul 2025
https://github.com/urbanclap-engg/smart-docs-parser
An OCR based document parser to extract information from identity document images
aadhaar auto-fill document-parser google-vision nodejs ocr pancard typescript user-onboarding
Last synced: 05 Oct 2025
https://github.com/decisionfacts/df-extract
DF Extract Lib
asyncio document-parser docx extraction jpeg jpg pdf png pptx python3
Last synced: 24 Apr 2025
https://github.com/clearedge-ai/clearedge
Build a RAG preprocessing pipeline
document-parser haystack langchain llamaindex llm ocr pdf pdf-ocr-extraction pdf-to-json pdf-to-text rag-pipeline retrieval-augmented-generation table-detection table-recognition
Last synced: 24 Mar 2025
https://github.com/has-abi/docparser
Extract text from your DOCX documents.
doc-parser document-parser docx-parser text-parser
Last synced: 14 Dec 2025
https://github.com/hrbrmstr/docparser
🧰 Tools to Upload/Parse Documents to 'docparser' and Retrieve Extracted Results
docparser document-parser r rstats
Last synced: 29 Oct 2025
https://github.com/gyanvir/drparser
Dr.Parser 🩸📊 – AI-powered blood report parser that extracts and analyzes medical data from images/PDFs. Built with React, FastAPI, EasyOCR, and Gemini AI. 🚀 🔹 Local Setup Available | 🔹 Future Enhancements Planned | 🔹 Hackathon Project 👉 Clone, run, and explore the future of AI-driven healthcare!
ai-ml blood-report-analysis document-parser easyocr fastapi hackathon-project healthcare medical-ai ocr reactjs team-euphoria
Last synced: 06 Oct 2025
https://github.com/vetrivel07/ai-powered-resume-evaluator
An AI-powered resume evaluation app that compares a candidate’s resume with a job description using Google’s Gemini 1.5 Flash model to provide HR-style feedback and an ATS-style match scoring through a simple and interactive Streamlit interface.
ats document-parser evaluator gemini-api gemini-flash genai python-library resume-analysis streamlit streamlit-application
Last synced: 01 Sep 2025
https://github.com/coderosh/docpa
A simple library that I use for web scraping. Uses htmlparser2 to parse dom.
docpa document-parser dom html-parser
Last synced: 12 Jul 2025
https://github.com/cr4yfish/docling-js
Parsing Documents to one datatype (Typescript port of Docling)
document-parser document-parsing genai pdf-converter pdf-to-text
Last synced: 31 Aug 2025
https://github.com/connectaman/deepseek-ocr-multigpu-infer
Efficient multi-GPU OCR inference framework leveraging parallel processes for accelerated token throughput and faster batch processing. Designed for scalable, high-performance optical character recognition workloads using PyTorch. Supports dynamic GPU assignment, optimized resource utilization, and easy integration for large-scale image datasets.
agentic-extraction data deepseek document-parser extraction extractor gpu image-parser llm multigpu nvidia ocr parallel-computing parser pdf-parser vlm
Last synced: 22 Jan 2026
https://github.com/revankumard/llamarker
Your ultimate tool for effortlessly converting and parsing documents into clean, well-structured Markdown—fast, reliable, and 100% local! 💻✨
document-parser llama-ai llamarker local-parsing-tool marker
Last synced: 22 Mar 2025
https://github.com/docling-project/docling4j
Docling4j brings the functionalities of Docling in document understanding to Java® projects
ai docling document-parser document-parsing document-understanding documents java pdf pdf-converter pdf-to-json
Last synced: 15 Jun 2025
https://github.com/shijincai/fast360
The industry's first "Open Source OCR Arena," a free, no-login utility for one-click benchmarking of 7 top-tier models (Marker, MinerU, MonkeyOCR, Docling, Dolphin, OCRFlux, PP-StructureV3) on your PDF/image files, specializing in PDF-to-Markdown conversion.
benchmark computer-vision data-extraction docling document-analysis document-parser evaluation latex latex-document machine-learning markdown-converter marker monkeyocr ocr ocr-service paddleocr pdf-converter pdf-to-markdown rag
Last synced: 30 Aug 2025
https://github.com/buren/document_parser
Small Rails API app to parse documents.
document-parser rails-api yomu
Last synced: 31 Aug 2025
https://github.com/midhunterx/scholar-cap
🎓 Set of powerful tools designed to streamline the extraction, parsing, and clean-up of data from docx and pdf forms. Saves time and eliminate manual data entry by automating the processing of structured data.
bank-details-validtion bulk-neft-generator dbms document-parser form-management multithreading
Last synced: 05 Mar 2025
https://github.com/setiaafandi/anyparser_crewai
Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.
anyparser cache-augmented-generation cag crew-ai crew-ai-rag crewai-rag document-parser document-parsing kag knowledge-graph python rag retrieval-augmented-generation typescript
Last synced: 07 Mar 2025
https://github.com/dills122/shamwow
Who likes lawyers? Me either; scrub your PII with ShamWow
attributes document-parser document-scrubber pii poco reflection scrub scrubber verify
Last synced: 31 Mar 2025
https://github.com/akandindajunior/cloud-services
If it’s not documented, it never happened. 📝 Please check my README.md for more details. 🔍
alibaba aws-amplify azure cats-over-dogs docker document document-parser google hacktoberfest java parsing pdf pdf-to-text rocket-ships
Last synced: 23 Jul 2025
https://github.com/anyparser/anyparser_crewai
Supercharge your AI workflows by combining Anyparser’s advanced content extraction with Crew AI. With this integration, you can effortlessly leverage Anyparser’s document processing and data extraction tools within your Crew AI applications.
anyparser artificial-intelligence cache-augmented-generation cag crew-ai crew-ai-rag crewai crewai-rag document-parser document-parsing kag knowledge-graph python rag retrieval-augmented-generation typescript
Last synced: 04 Oct 2025