Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/trisongz/pypdf-lib
A (maybe) Better PDF Parsing for Python focused on textual extraction
https://github.com/trisongz/pypdf-lib
Last synced: about 2 months ago
JSON representation
A (maybe) Better PDF Parsing for Python focused on textual extraction
- Host: GitHub
- URL: https://github.com/trisongz/pypdf-lib
- Owner: trisongz
- License: mit
- Created: 2021-05-10T22:21:21.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-06-02T22:18:58.000Z (over 3 years ago)
- Last Synced: 2024-09-21T10:16:38.723Z (3 months ago)
- Language: Python
- Size: 9.18 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pypdf-lib
A (maybe) Better PDF Parsing for Python focused on textual extraction. WIP.This library is a Python Wrapper built around [PdfAct](https://github.com/ad-freiburg/pdfact), which is built using Java.
## Pre-requisites
- `Java`
```bash
# Linux
apt-get update && apt-get install -y default-jre # openjdk-8-jre-headless / openjdk-11-jdk / openjdk-11-jre-headless# Mac
brew install java# Windows
# idk
```## Installation
```python
!pip install --upgrade git+https://github.com/trisongz/pypdf-lib.git
!pip install --upgrade pypdf-lib```
## Usage
```python
from pypdf import PyPDF
from fileio import Filebase_dir = '/content/output'
File.mkdirs(base_dir)# Using a remap function expects the file extension to be mapped properly - i.e. if 'txt' is selected, .txt file extension should be returned.
def remap_fnames(fname):
fname = File.base(fname).replace('- ', '').replace(' ', '_').strip().replace('.pdf', '.json')
return File.join(base_dir, fname)converter = PyPDF(input_dir='/content/inputs', output_dir='/content/output', units=['paragraphs', 'blocks'], visualize=True)
# remap_funct is optional.
for res in converter.extract(remap_funct=remap_fnames):
print(res)
# > /content/output/your_json_file_1.jsonconverter.extracted
'''
{'/content/inputs/input_1.pdf': '/content/output/your_json_file_1.json',
'/content/inputs/input_2.pdf': '/content/output/your_json_file_2.json',
'/content/inputs/input_3.pdf': '/content/output/your_json_file_3.json',
'params': {'exclude_roles': None,
'format': 'json',
'include_roles': ['title',
'body',
'appendix',
'keywords',
'heading',
'general_terms',
'toc',
'caption',
'table',
'other',
'categories',
'keywords',
'page_header'],
'units': ['paragraphs', 'blocks'],
'visualize': True,
'with_control_characters': False}}
'''
```