Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aryaminus/saram

Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract with auto rotation for wrong orientation. PYPI:
https://github.com/aryaminus/saram

character-recognition chmod image ocr orientation-detection pdf pillow pyocr pytesseract python tesseract wand

Last synced: about 2 months ago
JSON representation

Get OCR in txt form from an image or pdf extension supporting multiple files from directory using pytesseract with auto rotation for wrong orientation. PYPI:

Host: GitHub
URL: https://github.com/aryaminus/saram
Owner: aryaminus
License: mit
Created: 2018-02-09T17:01:54.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2022-12-27T15:33:41.000Z (about 2 years ago)
Last Synced: 2024-04-27T23:33:05.820Z (9 months ago)
Topics: character-recognition, chmod, image, ocr, orientation-detection, pdf, pillow, pyocr, pytesseract, python, tesseract, wand
Language: Python
Homepage: https://pypi.python.org/pypi/saram
Size: 34.2 KB
Stars: 51
Watchers: 8
Forks: 18
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Saram - Image/PDF OCR detection system
Get OCR in txt form from an image or pdf extension supporting multiple files from directory using `pytesseract` with support for rotation in case of wrong orientation along.

**Currently in beta state**

Follow: Demo run

[![Saram features](https://i.imgur.com/M9dAwPq.gif)](https://youtu.be/YF6Tf7qOXU4)

**Note:**
Make sure you have a OCR tool like `tesseract` and certain data value for comparing OCR, eg `tesseract-data-eng` along with `Pillow` and `Wand` for image conversion and loading which will be fetched during pip install.

**For using in python**:
Refer to the py-module branch

## Installation

Install using PIP:
```
$ pip install saram
$ saram
```
***else***

Clone the source locally:
```
$ git clone https://github.com/aryaminus/saram
$ cd saram
$ git checkout py-module
$ python main.py
```

## Todo
- [x] Add support for PDF by PDF -> Image -> Txt with converted image deletion after processing
- [x] Double check for orientation in case of image and PDF
- [x] Make a PIP package
- [ ] Add NLP to process the most repeated frequent characters to filer content
- [ ] Add Cloud Vision support for effective character recognization
- [ ] Suppot for GUI using tkinter

## Reference
1. pdf-to-txt
2. ocr-convert-image-to-text
3. fix-image-rotation
4. python-packaging

-----------------------------------------------------------------------------------------------------------

## Contributing

1. Fork it ()
2. Create your feature branch (`git checkout -b feature/fooBar`)
3. Commit your changes (`git commit -am 'Add some fooBar'`)
4. Push to the branch (`git push origin feature/fooBar`)
5. Create a new Pull Request

**Enjoy!**