Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alexcg1/executor-pdf-table-extractor

Extract PDF Tables
https://github.com/alexcg1/executor-pdf-table-extractor

Last synced: 16 days ago
JSON representation

Extract PDF Tables

Host: GitHub
URL: https://github.com/alexcg1/executor-pdf-table-extractor
Owner: alexcg1
Created: 2022-07-27T13:49:45.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-08-03T11:07:25.000Z (over 2 years ago)
Last Synced: 2024-10-17T08:15:35.557Z (about 1 month ago)
Language: Python
Size: 6.68 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# PDFTableExtractor

This Executor uses docs2info's [Table Extraction service](https://docs2info.com/services/tables/) to extract tables from PDF files and store them as chunks. The format of each chunk is simply the string of a CSV saved in `chunk.text` so it can be parsed by an encoder and then displayed in a front-end by rendering a CSV as table in the interface.

If anyone has a better idea of how to store tabular data in a way that can still be recognized by an encoder, drop an issue in the [repo](https://github.com/alexcg1/executor-pdf-table-extractor/issues).

## 🚨🚨 IMPORTANT WARNING 🚨🚨

Do not use this with confidential or sensitive documents. The table extraction service is still under heavy development and they use uploaded PDFs for training their model.

## License

The code is adapted from [`extract_tables.py`](https://docs2info.com/extract_tables.py) with the kind permission of the [author](https://twitter.com/docs2info). No license is specified.