Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/chezou/tabula-py
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
https://github.com/chezou/tabula-py
pandas pdf python tabula tabula-java
Last synced: 2 days ago
JSON representation
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
- Host: GitHub
- URL: https://github.com/chezou/tabula-py
- Owner: chezou
- License: mit
- Created: 2016-09-10T08:18:37.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2024-10-17T02:50:01.000Z (about 2 months ago)
- Last Synced: 2024-10-29T11:32:13.790Z (about 1 month ago)
- Topics: pandas, pdf, python, tabula, tabula-java
- Language: Python
- Homepage:
- Size: 42.4 MB
- Stars: 2,182
- Watchers: 47
- Forks: 300
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- stars - chezou/tabula-py - java: extract table from PDF into pandas DataFrame (HarmonyOS / Windows Manager)
README
# tabula-py
[![Build Status](https://github.com/chezou/tabula-py/actions/workflows/pythontest.yml/badge.svg)](https://github.com/chezou/tabula-py/actions/workflows/pythontest.yml)
[![PyPI version](https://badge.fury.io/py/tabula-py.svg)](https://badge.fury.io/py/tabula-py)
[![Documentation Status](https://readthedocs.org/projects/tabula-py/badge/?version=latest)](https://tabula-py.readthedocs.io/en/latest/?badge=latest)
![PyPI - Downloads](https://img.shields.io/pypi/dw/tabula-py)
[![](https://img.shields.io/badge/-Sponsor-fafbfc?logo=GitHub%20Sponsors
)](https://github.com/sponsors/chezou)`tabula-py` is a simple Python wrapper of [tabula-java](https://github.com/tabulapdf/tabula-java), which can read tables in a PDF.
You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV or a JSON file.You can see [the example notebook](https://nbviewer.jupyter.org/github/chezou/tabula-py/blob/master/examples/tabula_example.ipynb) and try it on Google Colab, or we highly recommend reading [our documentation](https://tabula-py.readthedocs.io/en/latest/), especially the FAQ section.
![tabula-py example](https://github.com/chezou/tabula-py/raw/master/example.png)
## Requirements
- Java 8+
- Python 3.9+### OS
I confirmed working on macOS and Ubuntu. But some people confirm it works on Windows 10. See also [the documentation for the detailed installation for Windows 10](https://tabula-py.readthedocs.io/en/latest/getting_started.html#get-tabula-py-working-windows-10).
## Usage
- [Documentation](https://tabula-py.readthedocs.io/en/latest/)
- [FAQ](https://tabula-py.readthedocs.io/en/latest/faq.html) would be helpful if you have an issue
- [Example notebook on Google Colaboratory](https://colab.research.google.com/github/chezou/tabula-py/blob/master/examples/tabula_example.ipynb)### Install
Ensure you have a Java runtime and set the PATH for it.
```bash
pip install tabula-py
```If you want to leverage faster execution with jpype, install with `jpype` extra.
```sh
pip install tabula-py[jpype]
```### Example
tabula-py enables you to extract tables from a PDF into a DataFrame, or a JSON. It can also extract tables from a PDF and save the file as a CSV, a TSV, or a JSON.
```py
import tabula# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf", pages='all')# Read remote pdf into list of DataFrame
dfs2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")# convert PDF into CSV file
tabula.convert_into("test.pdf", "output.csv", output_format="csv", pages='all')# convert all PDFs in a directory
tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')
```See [an example notebook](https://nbviewer.jupyter.org/github/chezou/tabula-py/blob/master/examples/tabula_example.ipynb) for more details. I also recommend reading [the tutorial article](https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py) written by [@aegis4048](https://github.com/aegis4048), and [another tutorial](https://www.dunderdata.com/blog/read-trapped-tables-within-pdfs-as-pandas-dataframes) written by [@tdpetrou](https://github.com/tdpetrou).
### Contributing
Interested in helping out? I'd love to have your help!
You can help by:
- [Reporting a bug](https://github.com/chezou/tabula-py/issues).
- Adding or editing documentation.
- Contributing code via a Pull Request. See also [for the contribution](docs/contributing.rst)
- Write a blog post or spread the word about `tabula-py` to people who might be able to benefit from using it.#### Contributors
- [@lahoffm](https://github.com/lahoffm)
- [@jakekara](https://github.com/jakekara)
- [@lcd1232](https://github.com/lcd1232)
- [@kirkholloway](https://github.com/kirkholloway)
- [@CurtLH](https://github.com/CurtLH)
- [@nikhilgk](https://github.com/nikhilgk)
- [@krassowski](https://github.com/krassowski)
- [@alexandreio](https://github.com/alexandreio)
- [@rmnevesLH](https://github.com/rmnevesLH)
- [@red-bin](https://github.com/red-bin)
- [@Gallaecio](https://github.com/Gallaecio)
- [@red-bin](https://github.com/red-bin)
- [@alexandreio](https://github.com/alexandreio)
- [@bpben](https://github.com/bpben)
- [@Bueddl](https://github.com/Bueddl)
- [@cjotade](https://github.com/cjotade)
- [@codeboy5](https://github.com/codeboy5)
- [@manohar-voggu](https://github.com/manohar-voggu)
- [@deveshSingh06](https://github.com/deveshSingh06)
- [@grfeller](https://github.com/grfeller)
- [@djbrown](https://github.com/djbrown)
- [@swar](https://github.com/swar)
- [@mvoggu](https://github.com/mvoggu)
- [@tdpetrou](https://github.com/tdpetrou)#### Another support
You can also support our continued work on `tabula-py` with a donation on GitHub Sponsors or [Patreon](https://www.patreon.com/chezou).