Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nikhil-swamix/pdf2text
https://github.com/nikhil-swamix/pdf2text
ocr-python pdf pdf-converter pdf-reader python text-mining
Last synced: 15 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/nikhil-swamix/pdf2text
- Owner: nikhil-swamix
- License: gpl-3.0
- Created: 2020-06-17T14:09:48.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-08-08T16:25:01.000Z (over 4 years ago)
- Last Synced: 2024-04-23T02:31:03.796Z (9 months ago)
- Topics: ocr-python, pdf, pdf-converter, pdf-reader, python, text-mining
- Language: Roff
- Size: 10.8 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF2Text
# Setup:
if on linux:
sudo apt install tesseract-ocr
sudo apt-get install tesseract-ocr
IF on windows install this
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe
ans now add this to path "C:\Program Files\Tesseract-OCR" without this nothing will work!
Now you need these Below commands to access these libraries, first pip install these:
paste these in terminal:
pip3 install PIL
pip3 install pytesseract
pip3 install pdf2image
when all setup now just run the pdfreader.py
what you will see is the text in the pdf printed in console/output
that is done by -> image_to_string(Image.open('pdfimg.jpg') command