Projects in Awesome Lists tagged with table-extraction
A curated list of projects in awesome lists tagged with table-extraction .
https://github.com/xberg-io/xberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
bun csharp document-intelligence elixir ffi golang java metadata-extraction node pdf-extraction pdfium php python rag ruby rust table-extraction tesseract text-extraction wasm
Last synced: 26 Jun 2026
https://github.com/kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
bun csharp document-intelligence elixir ffi golang java metadata-extraction node pdf-extraction pdfium php python rag ruby rust table-extraction tesseract text-extraction wasm
Last synced: 05 Jun 2026
https://github.com/pymupdf/pymupdf
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps
Last synced: 01 Apr 2026
https://github.com/jsvine/pdfplumber
Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
pdf pdf-parsing table-extraction
Last synced: 12 May 2025
https://github.com/pymupdf/PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps
Last synced: 08 Apr 2025
https://github.com/microsoft/table-transformer
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
table-detection table-extraction table-functional-analysis table-structure-recognition
Last synced: 14 May 2025
https://github.com/microsoft/table-transformer?tab=readme-ov-file
Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.
table-detection table-extraction table-functional-analysis table-structure-recognition
Last synced: 08 Apr 2025
https://github.com/Goldziher/kreuzberg
Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.
async document-intelligence mcp metadata-extraction ocr pandoc pdf-extraction pdfium python rag table-extraction tesseract text-extraction
Last synced: 21 Oct 2025
https://github.com/xavctn/img2table
img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing
image-processing opencv python table-extraction
Last synced: 14 May 2025
https://github.com/BobLd/DocumentLayoutAnalysis
Document Layout Analysis resources repos for development with PdfPig.
alto alto-xml csharp docstrum document-layout-analysis hocr hocr-documents layout-analysis page-segmentation page-xml pdf pdfpig recursive-xy-cut table-extraction tei xy-cut xycut
Last synced: 10 May 2025
https://github.com/bobld/documentlayoutanalysis
Document Layout Analysis resources repos for development with PdfPig.
alto alto-xml csharp docstrum document-layout-analysis hocr hocr-documents layout-analysis page-segmentation page-xml pdf pdfpig recursive-xy-cut table-extraction tei xy-cut xycut
Last synced: 04 Apr 2025
https://github.com/ExtractTable/ExtractTable-py
Python library to extract tabular data from images and scanned PDFs
extracttable image-table-recognition ocr pdf-table-extract table-extraction tabular-data
Last synced: 02 Apr 2025
https://github.com/bobld/tabula-sharp
Extract tables from PDF files (port of tabula-java)
csharp dotnet extract extract-table extracting-tables extraction extraction-engine netstandard pdf-table-extract pdf-table-extraction pdfparser pdfpig pdfs table table-extraction tabula tabula-java tabula-sharp
Last synced: 15 May 2025
https://github.com/hrbrmstr/docxtractr
:scissors: Extract Tables from Microsoft Word Documents with R
docx extract-tables microsoft-word r rstats table-extraction
Last synced: 05 Mar 2026
https://github.com/bzsanti/oxidizePdf
a PDF library for rust
crates-io data-extraction digital-signatures document-processing encryption invoice ocr pdf pdf-generation pdf-library pdf-manipulation pdf-parser pdf-reader pdfa rust rust-library table-extraction text-extraction
Last synced: 29 Apr 2026
https://github.com/tfmorris/pdf2table
PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz
information-extraction pdf-analysis table-extraction
Last synced: 22 Mar 2025
https://github.com/phamquiluan/go5-project
Extracting Tabular Data from Image to Excel files
excel-export image-processing table-extraction table-recognition
Last synced: 08 Oct 2025
https://github.com/bobld/camelot-sharp
A C# library to extract tabular data from PDFs (port of camelot Python version using PdfPig).
camelot camelot-sharp csharp dotnet extract-table extracting-tables extraction extraction-engine netstandard opencv pdf-table-extract pdf-table-extraction pdfparser pdfpig pdfs table table-extraction
Last synced: 14 Jun 2025
https://github.com/sergiocorreia/quipucamayoc
dev repo for article
ocr ocr-post-processing ocr-python poppler table-extraction table-ocr textract
Last synced: 12 Apr 2025
https://github.com/coregx/gxpdf
GxPDF - Enterprise-grade PDF library for Go. Table extraction, text parsing, encryption, document creation.
go golang open-source pdf pdf-encryption pdf-generation pdf-library pdf-parser table-extraction text-extraction
Last synced: 02 Apr 2026
https://github.com/randomstate/camelot-php
Camelot PDF table extraction library wrapper for PHP
Last synced: 15 Apr 2025
https://github.com/inquilabee/tablecv
TableCV: Table extraction from images made easy.
opencv opencv-python opencv-table opencv-table-extraction python table table-extract table-extract-python table-extraction
Last synced: 29 Jul 2025
https://github.com/os-climate/crrf-det
A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.
annotation data-extraction layout-analysis pdf table-extraction
Last synced: 12 Apr 2025
https://github.com/monchin/tablers
A blazingly fast PDF table extraction library with python API powered by Rust
pdf python rust table-extraction
Last synced: 01 Mar 2026
https://github.com/pavansomisetty21/extraction-of-tables-from-pdf
In this we extract tables from the pdf using fitz and pymudf
pdf table-detection table-extraction table-recognition table-structure-recognition table-to-excel table2cs table2excel tables tables-content
Last synced: 08 Jan 2026
https://github.com/dashroshan/data-extractor
Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.
document-extraction form-analysis key-value-pairs ocr-python table-extraction
Last synced: 06 Apr 2025
https://github.com/philgooch/pdftable
A fork of Kyle Cronan's Python 2.5 pdftable library, now updated for Python 3
csv pdf python python3-library table-extraction textmining
Last synced: 14 Jan 2026
https://github.com/mixpeek/multimodal-benchmarks
Open evaluation suite for multimodal retrieval systems with benchmarks for financial documents, medical devices, and educational content
benchmark document-retrieval embeddings evaluation hybrid-search information-retrieval multimodal-retrieval nlp ocr rag semantic-search table-extraction vector-search
Last synced: 12 Mar 2026
https://github.com/myatmyintzuthin/extract-table
Table Cell Coordinate Extraction From Image
image-processing table-extraction
Last synced: 29 Mar 2025
https://github.com/shahin-ro/table-detection
Python tool for table extraction & Persian OCR. Uses OpenCV for table detection, Tesseract for text extraction, & Pandas for data output. Visualizes cells & text. Ideal for Persian documents! 📄✨
colab computer-vision data-extraction data-visualization document-processing image-analysis image-processing machine-learning matplotlib numpy ocr opencv pandas persian-ocr persian-text python table-detection table-extraction tesseract text-recognition
Last synced: 08 Apr 2026
https://github.com/maxinexiong/acme-work-items-rpa
This repository contains a robust UiPath automation solution that utilises the UiPath REFramework to fulfill the specified requirements, which includes automating data scraping from acme-test.com, filtering specific records, and appending the results into an Excel worksheet.
acme-challenge data-scraping datatable excel-operations reframework robotic-enterprise-framework robotic-process-automation rpa table-extraction uipath uipath-modern-design uipath-reframework uipath-studio web-scraping
Last synced: 21 Jan 2026
https://github.com/maxinexiong/web-scraping-rpa
This repository contains an RPA robot that was designed to scrap up to 500 pieces of property information for a given location from a real estate website. The extracted data is then intelligently organized, filtered, and sorted according to user-defined criteria, and integrated into the Excel file, output.xlsx.
data-scraping data-table excel-processing robotic-process-automation rpa table-extraction uipath uipath-classic-design uipath-modern-design uipath-studio web-scraping
Last synced: 20 Mar 2026
https://github.com/timothy-bartlett/pymupdf
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
data-science extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction text-processing text-shaping xps
Last synced: 18 May 2026
https://github.com/jrodal98/paginated-table-extractor
A python script that automates the extraction of data from paginated tables.
data-extraction selenium-python selenium-webdriver table-extraction webscraping
Last synced: 11 Apr 2026
https://github.com/harubi/bolivar
High-performance PDF table extraction library. Bindings for Python and JVM.
jvm pdf pdf-parsing python rust table-extraction text-extraction
Last synced: 22 Feb 2026
https://github.com/maxinexiong/acme-dispatcher-performer-invoice-check-bot-rpa
This repository hosts a UiPath automation solution with separate Dispatcher and Performer sub-processes. The Dispatcher bot adds queue items to Orchestrator Queue, while the Performer bot searches invoices, extracts and compares data. Leveraging UiPath REFramework, this workflow provides a robust scalable solution for invoice checking tasks.
acme-challenge data-scraping dispatcher invoice-checker performer reframework robotic-enterprise-framework robotic-process-automation rpa table-extraction uipath uipath-automation-cloud uipath-modern-design uipath-orchestrator uipath-orchestrator-queue uipath-queue uipath-reframework uipath-studio web-scraping
Last synced: 20 Mar 2026
https://github.com/maxinexiong/acme-vendor-check-bot-rpa
This repository contains a robust UiPath automation solution utilising the REFramework, crafted to fulfill the specified requirements, including extracting data table from acme-test.com, comparing vendor information, handling various business exceptions, and appending the results into an Excel worksheet.
acme-challenge data-scraping datatable excel-operations reframework robotic-enterprise-framework robotic-process-automation rpa table-extraction uipath uipath-modern-design uipath-reframework uipath-studio vendor-checker web-scraping
Last synced: 20 Jan 2026
https://github.com/maxinexiong/coronavirus-stat-alert-bot-rpa
An automation solution designed to meet the challenge of creating a Coronavirus stat-alert bot. This bot is capable of scraping Coronavirus statistics from a user-inputted country and sending an email update with the collected data to specified recipients.
covid19-data data-scraping data-table email-automation email-sender outlook-email robotic-process-automation rpa table-extraction uipath uipath-studio
Last synced: 20 Mar 2026