https://github.com/philgooch/pdftable
A fork of Kyle Cronan's Python 2.5 pdftable library, now updated for Python 3
https://github.com/philgooch/pdftable
csv pdf python python3-library table-extraction textmining
Last synced: 7 days ago
JSON representation
A fork of Kyle Cronan's Python 2.5 pdftable library, now updated for Python 3
- Host: GitHub
- URL: https://github.com/philgooch/pdftable
- Owner: philgooch
- License: gpl-3.0
- Fork: true (jeremyjbowers/pdftable)
- Created: 2017-10-30T19:19:41.000Z (about 8 years ago)
- Default Branch: develop
- Last Pushed: 2017-11-04T16:07:36.000Z (about 8 years ago)
- Last Synced: 2025-10-20T12:44:44.788Z (3 months ago)
- Topics: csv, pdf, python, python3-library, table-extraction, textmining
- Language: Python
- Homepage:
- Size: 38.1 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF Tablr
Version: 0.1.0 [](https://travis-ci.org/philgooch/pdftable/)
This is a Python 3 module and command line utility that analyzes XML output from the
program `pdftohtml` in order to extract tables from PDF files and output the data as CSV.
For example:
pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv
See also `pdftable -h` and http://sourceforge.net/projects/pdftable
Original author: (c) 2009 Kyle Cronan
This Python 3 implementation: (c) 2017 Phil Gooch
As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.
# Installation
Install `pdftohtml` via `poppler-utils` (Linux) or `poppler` (Mac OSX)
Then install the module
python setup.py install
or
pip install pdftablr
## Command line usage
Extract each table into a separate CSV file:
pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv
Extract all tabular data into a single CSV file:
pdftohtml -xml -stdout file.pdf | pdftable -f file.csv
## Module usage
from pdftablr.table_extractor import Extractor
# XML file created from pdftohtml
input_path = '/path/to/file.xml'
# Output CSV file
output_path = '/path/to/output.csv'
with open(output_path, 'w') as output_file:
table_extractor = Extractor(output_file=output_file)
with open(input_path) as f:
table_extractor.read_file(f)
tables = table_extractor.extract()
for table in tables:
table.output(writer=None)
# TODO
- Investigate why Table.columns is sometimes initialised with empty columns
- Refactor all the file handling
- Execute pdftohtml within the code to allow PDF input