Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/adilsezer/pdfocr
An OCR project to read PDF files with Tesseract library
https://github.com/adilsezer/pdfocr
Last synced: about 2 months ago
JSON representation
An OCR project to read PDF files with Tesseract library
- Host: GitHub
- URL: https://github.com/adilsezer/pdfocr
- Owner: adilsezer
- Created: 2020-08-14T16:43:58.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-08-17T21:55:22.000Z (over 4 years ago)
- Last Synced: 2024-01-07T01:57:22.950Z (12 months ago)
- Language: Python
- Size: 1.67 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PDFOCR
An OCR project to read all PDF files in a folder with Tesseract library and write OCR text into a txt file## Requirements
* Pillow 7.2.0
* PyMuPDF 1.17.5
* PyPDF2 1.26.0
* pytesseract 0.3.5## External Dependencies:
Tesseract-OCR: https://github.com/tesseract-ocr/tesseract/wiki## Running from source
$ git clone https://github.com/sezerad/PDFOCR.git
$ cd PDFOCR
$ pip install -r requirements.txt
$ python PDFOCR.py## Screenshots
### Example PDF Document
![Alt text](https://github.com/sezerad/PDFOCR/blob/master/screenshots/ExamplePDF.png?raw=true "PDF OCR")
### Extracted Text
![Alt text](https://github.com/sezerad/PDFOCR/blob/master/screenshots/ExampleResult.png?raw=true "PDF OCR")