Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aryaminus/saram
Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract with auto rotation for wrong orientation. PYPI:
https://github.com/aryaminus/saram
character-recognition chmod image ocr orientation-detection pdf pillow pyocr pytesseract python tesseract wand
Last synced: about 2 months ago
JSON representation
Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract with auto rotation for wrong orientation. PYPI:
- Host: GitHub
- URL: https://github.com/aryaminus/saram
- Owner: aryaminus
- License: mit
- Created: 2018-02-09T17:01:54.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-12-27T15:33:41.000Z (about 2 years ago)
- Last Synced: 2024-04-27T23:33:05.820Z (9 months ago)
- Topics: character-recognition, chmod, image, ocr, orientation-detection, pdf, pillow, pyocr, pytesseract, python, tesseract, wand
- Language: Python
- Homepage: https://pypi.python.org/pypi/saram
- Size: 34.2 KB
- Stars: 51
- Watchers: 8
- Forks: 18
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Saram - Image/PDF OCR detection system
Get OCR in txt form from an image or pdf extension supporting multiple files from directory using `pytesseract` with support for rotation in case of wrong orientation along.**Currently in beta state**
Follow: Demo run
[![Saram features](https://i.imgur.com/M9dAwPq.gif)](https://youtu.be/YF6Tf7qOXU4)
**Note:**
Make sure you have a OCR tool like `tesseract` and certain data value for comparing OCR, eg `tesseract-data-eng` along with `Pillow` and `Wand` for image conversion and loading which will be fetched during pip install.**For using in python**:
Refer to the py-module branch## Installation
Install using PIP:
```
$ pip install saram
$ saram
```
***else***Clone the source locally:
```
$ git clone https://github.com/aryaminus/saram
$ cd saram
$ git checkout py-module
$ python main.py
```## Todo
- [x] Add support for PDF by PDF -> Image -> Txt with converted image deletion after processing
- [x] Double check for orientation in case of image and PDF
- [x] Make a PIP package
- [ ] Add NLP to process the most repeated frequent characters to filer content
- [ ] Add Cloud Vision support for effective character recognization
- [ ] Suppot for GUI using tkinter## Reference
1. pdf-to-txt
2. ocr-convert-image-to-text
3. fix-image-rotation
4. python-packaging-----------------------------------------------------------------------------------------------------------
## Contributing
1. Fork it ()
2. Create your feature branch (`git checkout -b feature/fooBar`)
3. Commit your changes (`git commit -am 'Add some fooBar'`)
4. Push to the branch (`git push origin feature/fooBar`)
5. Create a new Pull Request**Enjoy!**