Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nikhil-swamix/pdf2text

ocr-python pdf pdf-converter pdf-reader python text-mining

Last synced: 15 days ago
JSON representation

Host: GitHub
URL: https://github.com/nikhil-swamix/pdf2text
Owner: nikhil-swamix
License: gpl-3.0
Created: 2020-06-17T14:09:48.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2020-08-08T16:25:01.000Z (over 4 years ago)
Last Synced: 2024-04-23T02:31:03.796Z (9 months ago)
Topics: ocr-python, pdf, pdf-converter, pdf-reader, python, text-mining
Language: Roff
Size: 10.8 MB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# PDF2Text

# Setup:

if on linux:

sudo apt install tesseract-ocr

sudo apt-get install tesseract-ocr

IF on windows install this

https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe

ans now add this to path "C:\Program Files\Tesseract-OCR" without this nothing will work!

Now you need these Below commands to access these libraries, first pip install these:
paste these in terminal:

pip3 install PIL

pip3 install pytesseract

pip3 install pdf2image

when all setup now just run the pdfreader.py

what you will see is the text in the pdf printed in console/output
that is done by -> image_to_string(Image.open('pdfimg.jpg') command