https://github.com/leandroroser/prettyparser

Parallel processing and parsing PDF and TXT files, and Python objects with text (str, list) using rules (regular expressions).
https://github.com/leandroroser/prettyparser

pdf-parser regex

Last synced: 3 months ago
JSON representation

Parallel processing and parsing PDF and TXT files, and Python objects with text (str, list) using rules (regular expressions).

Host: GitHub
URL: https://github.com/leandroroser/prettyparser
Owner: leandroroser
License: apache-2.0
Created: 2021-11-09T02:05:02.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2023-01-29T22:03:01.000Z (about 3 years ago)
Last Synced: 2025-09-22T19:11:32.965Z (7 months ago)
Topics: pdf-parser, regex
Language: Python
Homepage: https://pypi.org/project/prettyparser/
Size: 106 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          
![icon](https://user-images.githubusercontent.com/10769732/140857203-e0580717-52c3-4cdd-affc-00ad5bf0a526.png)

prettyparser is a Python library for parallel processing and parsing PDF/TXT and Python objects with text (str, list) using rules (regular expressions). 

In case of PDF files, the package reads the content using pdfplumber and then performs a series of

data manipulations to generate a higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions, that are compiled for improved speed.

## Installation

```

$ git clone https://github.com/leandroroser/prettyparser

$ cd prettyparser

$ pip install -e .

```

or

```

$ pip install prettyparser

```

## Example: processing a series PDF files

```Python

import regex as re

from prettyparser import PrettyParser

files = ["./BOOKS/PDF/PDF1.pdf", "./BOOKS/PDF/PDF2.pdf"]

output = "./BOOKS/TXT"

parser = PrettyParser(files, None, output, mode = 'pdf',

                      args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],

                            [r"\n\s*-\d-\s*\n", r'\n\n'], 

                            [r"\n\s*(\* *)+\s*\n", r'\n\n'],

                            [r"__some_header_text", r'\n\n', re.IGNORECASE]],

                            remove_whitelines = True,

                            paragraphs_spacing = 1,

                            remove_hyphen_eol = True)

parser.run()

```

## Example: processing a folder with multiple PDF files

```Python

import regex as re

from prettyparser import PrettyParser

directory = "./BOOKS/PDF"

output = "./BOOKS/TXT"

parser = PrettyParser(None, directory, output, mode = 'pdf',

                      args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],

                            [r"\n\s*-\d-\s*\n", r'\n\n'], 

                            [r"\n\s*(\* *)+\s*\n", r'\n\n'],

                            [r"__some_header_text", r'\n\n', re.IGNORECASE]],

                            remove_whitelines = True,

                            paragraphs_spacing = 1,

                            remove_hyphen_eol = True)

parser.run()

```

## Example: processing a folder with multiple TXT files

Let's assume that the previous output isn't good enough and needs additional corrections. 

A quicker way for testing additional corrections can be implemented by using the previous TXT output:

```Python

directory = "./BOOKS/TXT"

output = "./BOOKS/TXT_REPARSED"

parser = PrettyParser(None, directory, output,  mode = 'txt', 

                        args=[[r"some other header.*\d+", r''],

                            [r"^\d+.*", r'', re.MULTILINE], 

                            [r"([A-Z]+)( *\n)([A-Z]+)", r'\1\3'],

                            remove_whitelines = True,

                            paragraphs_spacing = 1,

                            remove_hyphen_eol = True)

parser.run()

```

## Example: processing a Python str for a quick test of the app

```Python

import regex as re

from prettyparser import PrettyParser

txt = """

header to remove

This is a text with multiple problems. For exam-

ple the latter word can be joined. 

The portions of this line can be

joined

in a single line.

HERE ALSO IS SOME

UPPERCASE TEXT

TO JOIN

Some Other Ugly Stuff To Remove IGNORING Case. 

Remove the line below:

* * * 

Remove empty lines and finally separate paragraphs with a blank line.

Below is the page number->.

99

"""

parser = PrettyParser(txt, mode = "pyobj", args = [[r"\s*header to remove\s*\n",r""],

                                                    [r"(\n\s*\d+\s*\n)", r'\n\n'],

                                                    [r"\n\s*(\* *)+\s*\n", r'\n\n'],

                                                    [r"\n.*some other ugly stuff.*", 

                                                    r'\n\n', re.IGNORECASE]],

                                                    remove_whitelines = True,

                                                    paragraphs_spacing = 1,

                                                    remove_hyphen_eol = True)

output = parser.run()

print(output[0])

```

```

This is a text with multiple problems. For example the latter word can be joined.

The portions of this line can be joined in a single line.

HERE ALSO IS SOME UPPERCASE CASE TEXT TO JOIN

Remove the line below: 

Remove empty lines and finally separate each line with a blank line.

Below is the page number->.

```

## Runnning from the command line

```

 prettyparser --directories /home/BOOKS --output /home/BOOKS_PARSED --mode 'pdf'

```

Arguments

---------

- **files (list or str)**: Path to parse for pdf/txt operations. If a string is passed, it will be treated as a directory when mode is 'pdf' or 'txt'. If a str or list is passed when mode is 'pyobj', it will be treated as a str/list of text files already loaded in memory in the corresponding object

- **output (str)**: output directory

- **args (list)**: list of tuples of the form (regex, replacement, flags). The flag can be absent

- **mode (str)**: 'pdf', 'txt' or 'pyobj' (the latter for Python lists and strings)

- **default (bool)**: if True, perform several default cleanup operations (default)

- **remove_whitelines (bool)**: if True, remove whitespaces

- **paragraphs_spacing (int)**: number of newlines between paragraphs

- **page_spacing (str)**: string to insert between pages

- **remove_hyphen_eol (bool)**: if True, remove end of line hyphens and merge subwords

- **custom_pdf_fun (Callable)**: custom function to parse pdf files

- **overwrite(bool)**: Overwrite file if exists. Default False

- **n_jobs(int)**: Number of jobs. Default: number of cores -1

  It must accept a pdfplumber page as argument and return a text to be joined with previous pages

Current language support for the default parser

------------------------------------------------

English, Spanish, German, French, Portuguese

License

-------

© Leandro Roser, 2023. Licensed under an [Apache-2](https://github.com/leandroroser/prettyparser/blob/main/LICENSE.txt) license.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/leandroroser/prettyparser

Awesome Lists containing this project

README