Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/trisongz/pypdf-lib

A (maybe) Better PDF Parsing for Python focused on textual extraction
https://github.com/trisongz/pypdf-lib

Last synced: about 2 months ago
JSON representation

A (maybe) Better PDF Parsing for Python focused on textual extraction

Host: GitHub
URL: https://github.com/trisongz/pypdf-lib
Owner: trisongz
License: mit
Created: 2021-05-10T22:21:21.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2021-06-02T22:18:58.000Z (over 3 years ago)
Last Synced: 2024-09-21T10:16:38.723Z (3 months ago)
Language: Python
Size: 9.18 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # pypdf-lib

 

A (maybe) Better PDF Parsing for Python focused on textual extraction. WIP. 

This library is a Python Wrapper built around [PdfAct](https://github.com/ad-freiburg/pdfact), which is built using Java.

## Pre-requisites

- `Java`

```bash

# Linux

apt-get update && apt-get install -y default-jre # openjdk-8-jre-headless / openjdk-11-jdk / openjdk-11-jre-headless

# Mac

brew install java

# Windows

# idk

```

## Installation

```python

!pip install --upgrade git+https://github.com/trisongz/pypdf-lib.git

!pip install --upgrade pypdf-lib

```

## Usage

```python

from pypdf import PyPDF

from fileio import File

base_dir = '/content/output'

File.mkdirs(base_dir)

# Using a remap function expects the file extension to be mapped properly - i.e. if 'txt' is selected, .txt file extension should be returned.

def remap_fnames(fname):

    fname = File.base(fname).replace('- ', '').replace(' ', '_').strip().replace('.pdf', '.json')

    return File.join(base_dir, fname)

converter = PyPDF(input_dir='/content/inputs', output_dir='/content/output', units=['paragraphs', 'blocks'], visualize=True)

# remap_funct is optional. 

for res in converter.extract(remap_funct=remap_fnames):

    print(res)

    # > /content/output/your_json_file_1.json

converter.extracted

'''

{'/content/inputs/input_1.pdf': '/content/output/your_json_file_1.json',

 '/content/inputs/input_2.pdf': '/content/output/your_json_file_2.json',

 '/content/inputs/input_3.pdf': '/content/output/your_json_file_3.json',

 'params': {'exclude_roles': None,

  'format': 'json',

  'include_roles': ['title',

   'body',

   'appendix',

   'keywords',

   'heading',

   'general_terms',

   'toc',

   'caption',

   'table',

   'other',

   'categories',

   'keywords',

   'page_header'],

  'units': ['paragraphs', 'blocks'],

  'visualize': True,

  'with_control_characters': False}}

'''

```