https://github.com/philgooch/pdftable

A fork of Kyle Cronan's Python 2.5 pdftable library, now updated for Python 3
https://github.com/philgooch/pdftable

csv pdf python python3-library table-extraction textmining

Last synced: about 2 months ago
JSON representation

A fork of Kyle Cronan's Python 2.5 pdftable library, now updated for Python 3

Host: GitHub
URL: https://github.com/philgooch/pdftable
Owner: philgooch
License: gpl-3.0
Fork: true (jeremyjbowers/pdftable)
Created: 2017-10-30T19:19:41.000Z (over 8 years ago)
Default Branch: develop
Last Pushed: 2017-11-04T16:07:36.000Z (over 8 years ago)
Last Synced: 2025-10-20T12:44:44.788Z (5 months ago)
Topics: csv, pdf, python, python3-library, table-extraction, textmining
Language: Python
Homepage:
Size: 38.1 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# PDF Tablr

Version: 0.1.0 [![Build Status](https://travis-ci.org/philgooch/pdftable.svg)](https://travis-ci.org/philgooch/pdftable/)

This is a Python 3 module and command line utility that analyzes XML output from the
program `pdftohtml` in order to extract tables from PDF files and output the data as CSV.

For example:

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

See also `pdftable -h` and http://sourceforge.net/projects/pdftable

As per Kyle's code, this version is licensed under GPLv3. See LICENSE file.

# Installation

Install `pdftohtml` via `poppler-utils` (Linux) or `poppler` (Mac OSX)

Then install the module

python setup.py install

or

pip install pdftablr

## Command line usage

Extract each table into a separate CSV file:

pdftohtml -xml -stdout file.pdf | pdftable -f file%d.csv

Extract all tabular data into a single CSV file:

pdftohtml -xml -stdout file.pdf | pdftable -f file.csv

## Module usage

from pdftablr.table_extractor import Extractor

# XML file created from pdftohtml
input_path = '/path/to/file.xml'

# Output CSV file
output_path = '/path/to/output.csv'

with open(output_path, 'w') as output_file:
table_extractor = Extractor(output_file=output_file)

with open(input_path) as f:
table_extractor.read_file(f)

tables = table_extractor.extract()
for table in tables:
table.output(writer=None)

# TODO
- Investigate why Table.columns is sometimes initialised with empty columns
- Refactor all the file handling
- Execute pdftohtml within the code to allow PDF input

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/philgooch/pdftable

Awesome Lists containing this project

README