https://github.com/nlpatvcu/pdftotextextractor
https://github.com/nlpatvcu/pdftotextextractor
Last synced: 8 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/nlpatvcu/pdftotextextractor
- Owner: NLPatVCU
- Created: 2023-04-21T19:42:18.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-30T22:46:17.000Z (over 2 years ago)
- Last Synced: 2024-11-16T17:28:48.315Z (over 1 year ago)
- Language: Python
- Size: 27.3 KB
- Stars: 1
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.MD
Awesome Lists containing this project
README
install.sh is for Debian Linux
Prerequisites: apt and python above 3.8
The project includes 3 main parts:
PDF Text Extractor - extracts text from PDF
Image Extractor from PDF - extracts images and saves it to a folder
Text Visualizer - Visualize the text to see what the computer recognizes
If on debian linux do
Sudo bash install.sh
Steps:
1. Install tesseract-ocr and libtesseract-dev using your os package installed
2. Create a virual env python3 -m venv venv
3. source venv/bin/activate
4. Install all libraries required pip install -r requirments.txt
Depending on your work load either use **main.py** if you want a graphical interface or **maincli.py** to use command line argumets
For __mainCLI.py__ you can use either syntax
`python3 main.py PDFfile`
or
`python3 main.py PDFfile -o outputFileName`
For __visualizer.py__ the syntax is
`python3 visualizer.py PDFfile`