Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alchemine/python-pdf-reader
Example code for reading and processing PDF files with python
https://github.com/alchemine/python-pdf-reader
pdf python
Last synced: 27 days ago
JSON representation
Example code for reading and processing PDF files with python
- Host: GitHub
- URL: https://github.com/alchemine/python-pdf-reader
- Owner: alchemine
- Created: 2023-11-19T06:36:35.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2023-11-20T04:18:45.000Z (12 months ago)
- Last Synced: 2024-04-18T00:11:31.234Z (7 months ago)
- Topics: pdf, python
- Language: Jupyter Notebook
- Homepage:
- Size: 787 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Python PDF Reader
Example codes for reading and processing PDF files with python# 1. Tabular PDF
## 1.1 Download Dataset
- [script/download_tabular_pdf.sh](https://github.com/alchemine/python-pdf-reader/blob/main/script/download_tabular_pdf.sh)## 1.2 Processing Data
- [python_pdf_reader/read_summary_layout.ipynb](https://github.com/alchemine/python-pdf-reader/blob/main/python_pdf_reader/read_tabular_pdf.ipynb)### 1.2.1 Split Columns
- **Automatic**
```python
from tabula import read_pdf
columns = [93, 126, 150, 200, 320, 390, 660, 703] # point
raw_datas = read_pdf(summary_path, columns=columns, guess=False, pages='all', silent=True)
```
- **Manual**
```python
from tabula import read_pdf
raw_datas = read_pdf(summary_path, guess=True, pages='all', silent=True)
```### 1.2.2 Processing Output DataFrame