https://github.com/leandroroser/prettyparser
Parallel processing and parsing PDF and TXT files, and Python objects with text (str, list) using rules (regular expressions).
https://github.com/leandroroser/prettyparser
pdf-parser regex
Last synced: about 2 months ago
JSON representation
Parallel processing and parsing PDF and TXT files, and Python objects with text (str, list) using rules (regular expressions).
- Host: GitHub
- URL: https://github.com/leandroroser/prettyparser
- Owner: leandroroser
- License: apache-2.0
- Created: 2021-11-09T02:05:02.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2023-01-29T22:03:01.000Z (about 3 years ago)
- Last Synced: 2025-09-22T19:11:32.965Z (6 months ago)
- Topics: pdf-parser, regex
- Language: Python
- Homepage: https://pypi.org/project/prettyparser/
- Size: 106 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README

prettyparser is a Python library for parallel processing and parsing PDF/TXT and Python objects with text (str, list) using rules (regular expressions).
In case of PDF files, the package reads the content using pdfplumber and then performs a series of
data manipulations to generate a higher quality output, removing the boilerplate code needed to read/process/write the content of multiple files with multiple pages. A custom processing function using pdfplumber that takes a page and returns a processed text is also allowed. Additional data processing steps can be added via custom regular expressions, that are compiled for improved speed.
## Installation
```
$ git clone https://github.com/leandroroser/prettyparser
$ cd prettyparser
$ pip install -e .
```
or
```
$ pip install prettyparser
```
## Example: processing a series PDF files
```Python
import regex as re
from prettyparser import PrettyParser
files = ["./BOOKS/PDF/PDF1.pdf", "./BOOKS/PDF/PDF2.pdf"]
output = "./BOOKS/TXT"
parser = PrettyParser(files, None, output, mode = 'pdf',
args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],
[r"\n\s*-\d-\s*\n", r'\n\n'],
[r"\n\s*(\* *)+\s*\n", r'\n\n'],
[r"__some_header_text", r'\n\n', re.IGNORECASE]],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
parser.run()
```
## Example: processing a folder with multiple PDF files
```Python
import regex as re
from prettyparser import PrettyParser
directory = "./BOOKS/PDF"
output = "./BOOKS/TXT"
parser = PrettyParser(None, directory, output, mode = 'pdf',
args = [[r"(\n\s*\d+\s*\n)|(\n\s*\d+\s*$)", r'\n\n'],
[r"\n\s*-\d-\s*\n", r'\n\n'],
[r"\n\s*(\* *)+\s*\n", r'\n\n'],
[r"__some_header_text", r'\n\n', re.IGNORECASE]],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
parser.run()
```
## Example: processing a folder with multiple TXT files
Let's assume that the previous output isn't good enough and needs additional corrections.
A quicker way for testing additional corrections can be implemented by using the previous TXT output:
```Python
directory = "./BOOKS/TXT"
output = "./BOOKS/TXT_REPARSED"
parser = PrettyParser(None, directory, output, mode = 'txt',
args=[[r"some other header.*\d+", r''],
[r"^\d+.*", r'', re.MULTILINE],
[r"([A-Z]+)( *\n)([A-Z]+)", r'\1\3'],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
parser.run()
```
## Example: processing a Python str for a quick test of the app
```Python
import regex as re
from prettyparser import PrettyParser
txt = """
header to remove
This is a text with multiple problems. For exam-
ple the latter word can be joined.
The portions of this line can be
joined
in a single line.
HERE ALSO IS SOME
UPPERCASE TEXT
TO JOIN
Some Other Ugly Stuff To Remove IGNORING Case.
Remove the line below:
* * *
Remove empty lines and finally separate paragraphs with a blank line.
Below is the page number->.
99
"""
parser = PrettyParser(txt, mode = "pyobj", args = [[r"\s*header to remove\s*\n",r""],
[r"(\n\s*\d+\s*\n)", r'\n\n'],
[r"\n\s*(\* *)+\s*\n", r'\n\n'],
[r"\n.*some other ugly stuff.*",
r'\n\n', re.IGNORECASE]],
remove_whitelines = True,
paragraphs_spacing = 1,
remove_hyphen_eol = True)
output = parser.run()
print(output[0])
```
```
This is a text with multiple problems. For example the latter word can be joined.
The portions of this line can be joined in a single line.
HERE ALSO IS SOME UPPERCASE CASE TEXT TO JOIN
Remove the line below:
Remove empty lines and finally separate each line with a blank line.
Below is the page number->.
```
## Runnning from the command line
```
prettyparser --directories /home/BOOKS --output /home/BOOKS_PARSED --mode 'pdf'
```
Arguments
---------
- **files (list or str)**: Path to parse for pdf/txt operations. If a string is passed, it will be treated as a directory when mode is 'pdf' or 'txt'. If a str or list is passed when mode is 'pyobj', it will be treated as a str/list of text files already loaded in memory in the corresponding object
- **output (str)**: output directory
- **args (list)**: list of tuples of the form (regex, replacement, flags). The flag can be absent
- **mode (str)**: 'pdf', 'txt' or 'pyobj' (the latter for Python lists and strings)
- **default (bool)**: if True, perform several default cleanup operations (default)
- **remove_whitelines (bool)**: if True, remove whitespaces
- **paragraphs_spacing (int)**: number of newlines between paragraphs
- **page_spacing (str)**: string to insert between pages
- **remove_hyphen_eol (bool)**: if True, remove end of line hyphens and merge subwords
- **custom_pdf_fun (Callable)**: custom function to parse pdf files
- **overwrite(bool)**: Overwrite file if exists. Default False
- **n_jobs(int)**: Number of jobs. Default: number of cores -1
It must accept a pdfplumber page as argument and return a text to be joined with previous pages
Current language support for the default parser
------------------------------------------------
English, Spanish, German, French, Portuguese
License
-------
© Leandro Roser, 2023. Licensed under an [Apache-2](https://github.com/leandroroser/prettyparser/blob/main/LICENSE.txt) license.