Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MaxAFriedrich/pdfParser
This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.
https://github.com/MaxAFriedrich/pdfParser
Last synced: 3 months ago
JSON representation
This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like Poppler's pdftotext, it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.
- Host: GitHub
- URL: https://github.com/MaxAFriedrich/pdfParser
- Owner: MaxAFriedrich
- License: other
- Created: 2023-01-04T22:16:51.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2023-01-15T10:27:53.000Z (almost 2 years ago)
- Last Synced: 2024-05-30T02:45:45.246Z (6 months ago)
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# pdfParser
This program converts one or multiple PDFs to easily readable plain text. Unlike other programs like [Poppler's pdftotext](https://en.wikipedia.org/wiki/Poppler_(software)#poppler-utils), it not only converts the PDF to plain text, but also improves readability by removing unnecessary new lines, spaces, headers, and footers.
## Quick start
NOTE: This program has been designed with python 3.10 and later in mind.
Download this repository:
``` bash
git clone https://github.com/MaxAFriedrich/pdfParser
cd pdfParser
```Then run the program, providing files as arguments.
``` bash
python pdfParser.py /location/of/pdf/filename.pdf
```It may be useful to alias this program so you can run it from other location in your environment.
### TODO
- Automatic built in OCR scanning
- Remove tables and diagrams
- Output options
- Convert the pdf to markdown## License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.