Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/MaxAFriedrich/pdfParser

This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.
https://github.com/MaxAFriedrich/pdfParser

Last synced: about 2 months ago
JSON representation

This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.

Awesome Lists containing this project

README

        

# pdfParser

This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like [Poppler's pdftotext](https://en.wikipedia.org/wiki/Poppler_(software)#poppler-utils), it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.

## Quick start

NOTE: This program has been designed with python 3.10 and later in mind.

Download this repository:

``` bash
git clone https://github.com/MaxAFriedrich/pdfParser
cd pdfParser
```

Then run the program, providing files as arguments.

``` bash
python pdfParser.py /location/of/pdf/filename.pdf
```

It may be useful to alias this program so you can run it from other location in your environment.

### TODO

- Automatic built in OCR scanning
- Remove tables and diagrams
- Output options
- Convert the pdf to markdown

## License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.