An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with table-extraction

A curated list of projects in awesome lists tagged with table-extraction .

https://github.com/xberg-io/xberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

bun csharp document-intelligence elixir ffi golang java metadata-extraction node pdf-extraction pdfium php python rag ruby rust table-extraction tesseract text-extraction wasm

Last synced: 26 Jun 2026

https://github.com/kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

bun csharp document-intelligence elixir ffi golang java metadata-extraction node pdf-extraction pdfium php python rag ruby rust table-extraction tesseract text-extraction wasm

Last synced: 05 Jun 2026

https://github.com/pymupdf/pymupdf

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps

Last synced: 01 Apr 2026

https://github.com/jsvine/pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

pdf pdf-parsing table-extraction

Last synced: 12 May 2025

https://github.com/pymupdf/PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps

Last synced: 08 Apr 2025

https://github.com/microsoft/table-transformer

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

table-detection table-extraction table-functional-analysis table-structure-recognition

Last synced: 14 May 2025

https://github.com/microsoft/table-transformer?tab=readme-ov-file

Table Transformer (TATR) is a deep learning model for extracting tables from unstructured documents (PDFs and images). This is also the official repository for the PubTables-1M dataset and GriTS evaluation metric.

table-detection table-extraction table-functional-analysis table-structure-recognition

Last synced: 08 Apr 2025

https://github.com/Goldziher/kreuzberg

Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

async document-intelligence mcp metadata-extraction ocr pandoc pdf-extraction pdfium python rag table-extraction tesseract text-extraction

Last synced: 21 Oct 2025

https://github.com/xavctn/img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing

image-processing opencv python table-extraction

Last synced: 14 May 2025

https://github.com/ExtractTable/ExtractTable-py

Python library to extract tabular data from images and scanned PDFs

extracttable image-table-recognition ocr pdf-table-extract table-extraction tabular-data

Last synced: 02 Apr 2025

https://github.com/hrbrmstr/docxtractr

:scissors: Extract Tables from Microsoft Word Documents with R

docx extract-tables microsoft-word r rstats table-extraction

Last synced: 05 Mar 2026

https://github.com/tfmorris/pdf2table

PDF Table Extractor - repository to hold revisable version of code from https://www.cvast.tuwien.ac.at/projects/pdf2table by Burcu Yildiz

information-extraction pdf-analysis table-extraction

Last synced: 22 Mar 2025

https://github.com/phamquiluan/go5-project

Extracting Tabular Data from Image to Excel files

excel-export image-processing table-extraction table-recognition

Last synced: 08 Oct 2025

https://github.com/coregx/gxpdf

GxPDF - Enterprise-grade PDF library for Go. Table extraction, text parsing, encryption, document creation.

go golang open-source pdf pdf-encryption pdf-generation pdf-library pdf-parser table-extraction text-extraction

Last synced: 02 Apr 2026

https://github.com/randomstate/camelot-php

Camelot PDF table extraction library wrapper for PHP

pdf table-extraction

Last synced: 15 Apr 2025

https://github.com/os-climate/crrf-det

A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.

annotation data-extraction layout-analysis pdf table-extraction

Last synced: 12 Apr 2025

https://github.com/monchin/tablers

A blazingly fast PDF table extraction library with python API powered by Rust

pdf python rust table-extraction

Last synced: 01 Mar 2026

https://github.com/dashroshan/data-extractor

Extract and download key-value pairs, tables, and paragraphs from your scanned pdf, jpg, and png documents as CSV files.

document-extraction form-analysis key-value-pairs ocr-python table-extraction

Last synced: 06 Apr 2025

https://github.com/philgooch/pdftable

A fork of Kyle Cronan's Python 2.5 pdftable library, now updated for Python 3

csv pdf python python3-library table-extraction textmining

Last synced: 14 Jan 2026

https://github.com/mixpeek/multimodal-benchmarks

Open evaluation suite for multimodal retrieval systems with benchmarks for financial documents, medical devices, and educational content

benchmark document-retrieval embeddings evaluation hybrid-search information-retrieval multimodal-retrieval nlp ocr rag semantic-search table-extraction vector-search

Last synced: 12 Mar 2026

https://github.com/myatmyintzuthin/extract-table

Table Cell Coordinate Extraction From Image

image-processing table-extraction

Last synced: 29 Mar 2025

https://github.com/shahin-ro/table-detection

Python tool for table extraction & Persian OCR. Uses OpenCV for table detection, Tesseract for text extraction, & Pandas for data output. Visualizes cells & text. Ideal for Persian documents! 📄✨

colab computer-vision data-extraction data-visualization document-processing image-analysis image-processing machine-learning matplotlib numpy ocr opencv pandas persian-ocr persian-text python table-detection table-extraction tesseract text-recognition

Last synced: 08 Apr 2026

https://github.com/maxinexiong/acme-work-items-rpa

This repository contains a robust UiPath automation solution that utilises the UiPath REFramework to fulfill the specified requirements, which includes automating data scraping from acme-test.com, filtering specific records, and appending the results into an Excel worksheet.

acme-challenge data-scraping datatable excel-operations reframework robotic-enterprise-framework robotic-process-automation rpa table-extraction uipath uipath-modern-design uipath-reframework uipath-studio web-scraping

Last synced: 21 Jan 2026

https://github.com/maxinexiong/web-scraping-rpa

This repository contains an RPA robot that was designed to scrap up to 500 pieces of property information for a given location from a real estate website. The extracted data is then intelligently organized, filtered, and sorted according to user-defined criteria, and integrated into the Excel file, output.xlsx.

data-scraping data-table excel-processing robotic-process-automation rpa table-extraction uipath uipath-classic-design uipath-modern-design uipath-studio web-scraping

Last synced: 20 Mar 2026

https://github.com/timothy-bartlett/pymupdf

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

data-science extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction text-processing text-shaping xps

Last synced: 18 May 2026

https://github.com/jrodal98/paginated-table-extractor

A python script that automates the extraction of data from paginated tables.

data-extraction selenium-python selenium-webdriver table-extraction webscraping

Last synced: 11 Apr 2026

https://github.com/harubi/bolivar

High-performance PDF table extraction library. Bindings for Python and JVM.

jvm pdf pdf-parsing python rust table-extraction text-extraction

Last synced: 22 Feb 2026

https://github.com/maxinexiong/acme-dispatcher-performer-invoice-check-bot-rpa

This repository hosts a UiPath automation solution with separate Dispatcher and Performer sub-processes. The Dispatcher bot adds queue items to Orchestrator Queue, while the Performer bot searches invoices, extracts and compares data. Leveraging UiPath REFramework, this workflow provides a robust scalable solution for invoice checking tasks.

acme-challenge data-scraping dispatcher invoice-checker performer reframework robotic-enterprise-framework robotic-process-automation rpa table-extraction uipath uipath-automation-cloud uipath-modern-design uipath-orchestrator uipath-orchestrator-queue uipath-queue uipath-reframework uipath-studio web-scraping

Last synced: 20 Mar 2026

https://github.com/maxinexiong/acme-vendor-check-bot-rpa

This repository contains a robust UiPath automation solution utilising the REFramework, crafted to fulfill the specified requirements, including extracting data table from acme-test.com, comparing vendor information, handling various business exceptions, and appending the results into an Excel worksheet.

acme-challenge data-scraping datatable excel-operations reframework robotic-enterprise-framework robotic-process-automation rpa table-extraction uipath uipath-modern-design uipath-reframework uipath-studio vendor-checker web-scraping

Last synced: 20 Jan 2026

https://github.com/maxinexiong/coronavirus-stat-alert-bot-rpa

An automation solution designed to meet the challenge of creating a Coronavirus stat-alert bot. This bot is capable of scraping Coronavirus statistics from a user-inputted country and sending an email update with the collected data to specified recipients.

covid19-data data-scraping data-table email-automation email-sender outlook-email robotic-process-automation rpa table-extraction uipath uipath-studio

Last synced: 20 Mar 2026