Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nlpatvcu/pdftotextextractor

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/nlpatvcu/pdftotextextractor
Owner: NLPatVCU
Created: 2023-04-21T19:42:18.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-11-30T22:46:17.000Z (about 1 year ago)
Last Synced: 2023-11-30T23:30:40.219Z (about 1 year ago)
Language: Python
Size: 27.3 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.MD

Awesome Lists containing this project

README

        install.sh is for Debian Linux

Prerequisites: apt and python above 3.8

The project includes 3 main parts:


PDF Text Extractor - extracts text from PDF

Image Extractor from PDF - extracts images and saves it to a folder 

Text Visualizer - Visualize the text to see what the computer recognizes

If on debian linux do 
Sudo bash install.sh

Steps:

1. Install tesseract-ocr and libtesseract-dev using your os package installed

2. Create a virual env python3 -m venv venv

3. source venv/bin/activate

4. Install all libraries required pip install -r requirments.txt

Depending on your work load either use **main.py** if you want a graphical interface or **maincli.py** to use command line argumets

For __mainCLI.py__ you can use either syntax




`python3 main.py PDFfile`   

or 




`python3 main.py PDFfile -o outputFileName`

For __visualizer.py__ the syntax is




`python3 visualizer.py PDFfile`