Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nlpatvcu/pdftotextextractor


https://github.com/nlpatvcu/pdftotextextractor

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

install.sh is for Debian Linux
Prerequisites: apt and python above 3.8

The project includes 3 main parts:

PDF Text Extractor - extracts text from PDF
Image Extractor from PDF - extracts images and saves it to a folder
Text Visualizer - Visualize the text to see what the computer recognizes

If on debian linux do

Sudo bash install.sh

Steps:
1. Install tesseract-ocr and libtesseract-dev using your os package installed
2. Create a virual env python3 -m venv venv
3. source venv/bin/activate
4. Install all libraries required pip install -r requirments.txt

Depending on your work load either use **main.py** if you want a graphical interface or **maincli.py** to use command line argumets

For __mainCLI.py__ you can use either syntax


`python3 main.py PDFfile`
or


`python3 main.py PDFfile -o outputFileName`

For __visualizer.py__ the syntax is


`python3 visualizer.py PDFfile`