https://github.com/macabeus/pyslibtesseract
✏️ Integration of Tesseract for Python using a shared library
https://github.com/macabeus/pyslibtesseract
hocr ocr tesseract
Last synced: 10 months ago
JSON representation
✏️ Integration of Tesseract for Python using a shared library
- Host: GitHub
- URL: https://github.com/macabeus/pyslibtesseract
- Owner: macabeus
- Created: 2015-12-20T05:54:06.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2016-03-25T17:40:49.000Z (over 10 years ago)
- Last Synced: 2025-08-18T12:54:49.445Z (10 months ago)
- Topics: hocr, ocr, tesseract
- Language: Python
- Homepage: https://pypi.org/project/pyslibtesseract/
- Size: 33.2 KB
- Stars: 12
- Watchers: 2
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# pyslibtesseract
Integration of Tesseract for Python using a shared library
## To install
From PyPI
sudo pip3 install pyslibtesseract
From Github
sudo apt-get install libtesseract-dev
sudo apt-get install libleptonica-dev
git clone https://github.com/brunomacabeusbr/pyslibtesseract.git
cd pyslibtesseract
cd src/cppcode/ && cmake . && make && cd ../.. && sudo python3 setup.py install
## To use
### Start
You must create a object of TesseractConfig:
config_single_char = TesseractConfig(psm=PageSegMode.PSM_SINGLE_CHAR)
config_line = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE)
config_line_portuguese_brazilian = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE, lang='pt-br')
Possible PSM (page segmentation mode) are:
PSM_OSD_ONLY
PSM_AUTO_OSD
PSM_AUTO_ONLY
PSM_AUTO
PSM_SINGLE_COLUMN
PSM_SINGLE_BLOCK_VERT_TEX
PSM_SINGLE_BLOCK
PSM_SINGLE_LINE
PSM_SINGLE_WORD
PSM_CIRCLE_WORD
PSM_SINGLE_CHAR
PSM_SPARSE_TEXT
PSM_SPARSE_TEXT_OSD
PSM_COUNT
You can set variables of Tesseract:
config_single_char.add_variable('tessedit_char_whitelist', 'QWERTYUIOPASDFGHJKLZXCVBNM')
### Read
The first parameter is always the configuration and the second parameter is always the image path
Read a pharese

>>> LibTesseract.simple_read(config_line, 'phrase.png')
the book is on the table
Read a pharese and say confidence in each sentence

>>> LibTesseract.read_and_get_confidence_word(config_line, 'phrase.png')
[('he', 82.19984436035156), ('is', 84.98550415039062), ('readlnq', 75.25213623046875), ('the', 74.60755157470703), ('book', 85.8053207397461)]
Read a char, say confidence and other possible characters

>>> LibTesseract.read_and_get_confidence_char(config_single_char, 'char.png')
[('E', 58.27500915527344), ('Y', 56.93630599975586), ('F', 56.4453125), ('T', 51.12168884277344), ('Q', 47.19916534423828), ('W', 46.1181640625), ('V', 45.31656265258789), ('G', 43.49636459350586)]
### hOCR
If you want a return with hOCR format, you need a create config with `hocr=True`
>>> config_line_with_hocr = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE, hocr=True)
or edit a alredy exist config
>>> config_line.hocr = True
Then, use a method `simple_read`
>>> LibTesseract.simple_read(config_line_with_hocr, 'phrase.png')
the book is on the table