Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nlpatvcu/pdftotextextractor
https://github.com/nlpatvcu/pdftotextextractor
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/nlpatvcu/pdftotextextractor
- Owner: NLPatVCU
- Created: 2023-04-21T19:42:18.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-30T22:46:17.000Z (about 1 year ago)
- Last Synced: 2023-11-30T23:30:40.219Z (about 1 year ago)
- Language: Python
- Size: 27.3 KB
- Stars: 1
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.MD
Awesome Lists containing this project
README
install.sh is for Debian Linux
Prerequisites: apt and python above 3.8The project includes 3 main parts:
PDF Text Extractor - extracts text from PDF
Image Extractor from PDF - extracts images and saves it to a folder
Text Visualizer - Visualize the text to see what the computer recognizesIf on debian linux do
Sudo bash install.sh
Steps:
1. Install tesseract-ocr and libtesseract-dev using your os package installed
2. Create a virual env python3 -m venv venv
3. source venv/bin/activate
4. Install all libraries required pip install -r requirments.txtDepending on your work load either use **main.py** if you want a graphical interface or **maincli.py** to use command line argumets
For __mainCLI.py__ you can use either syntax
`python3 main.py PDFfile`
or
`python3 main.py PDFfile -o outputFileName`For __visualizer.py__ the syntax is
`python3 visualizer.py PDFfile`