Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alexcg1/executor-pdf-table-extractor
Extract PDF Tables
https://github.com/alexcg1/executor-pdf-table-extractor
Last synced: 16 days ago
JSON representation
Extract PDF Tables
- Host: GitHub
- URL: https://github.com/alexcg1/executor-pdf-table-extractor
- Owner: alexcg1
- Created: 2022-07-27T13:49:45.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-08-03T11:07:25.000Z (over 2 years ago)
- Last Synced: 2024-10-17T08:15:35.557Z (about 1 month ago)
- Language: Python
- Size: 6.68 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PDFTableExtractor
This Executor uses docs2info's [Table Extraction service](https://docs2info.com/services/tables/) to extract tables from PDF files and store them as chunks. The format of each chunk is simply the string of a CSV saved in `chunk.text` so it can be parsed by an encoder and then displayed in a front-end by rendering a CSV as table in the interface.
If anyone has a better idea of how to store tabular data in a way that can still be recognized by an encoder, drop an issue in the [repo](https://github.com/alexcg1/executor-pdf-table-extractor/issues).
## 🚨🚨 IMPORTANT WARNING 🚨🚨
Do not use this with confidential or sensitive documents. The table extraction service is still under heavy development and they use uploaded PDFs for training their model.
## License
The code is adapted from [`extract_tables.py`](https://docs2info.com/extract_tables.py) with the kind permission of the [author](https://twitter.com/docs2info). No license is specified.