https://github.com/macabeus/pyslibtesseract

✏️ Integration of Tesseract for Python using a shared library
https://github.com/macabeus/pyslibtesseract

hocr ocr tesseract

Last synced: 10 months ago
JSON representation

✏️ Integration of Tesseract for Python using a shared library

Host: GitHub
URL: https://github.com/macabeus/pyslibtesseract
Owner: macabeus
Created: 2015-12-20T05:54:06.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2016-03-25T17:40:49.000Z (over 10 years ago)
Last Synced: 2025-08-18T12:54:49.445Z (11 months ago)
Topics: hocr, ocr, tesseract
Language: Python
Homepage: https://pypi.org/project/pyslibtesseract/
Size: 33.2 KB
Stars: 12
Watchers: 2
Forks: 2
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # pyslibtesseract

Integration of Tesseract for Python using a shared library

## To install

From PyPI

    sudo pip3 install pyslibtesseract

    

From Github

    sudo apt-get install libtesseract-dev

    sudo apt-get install libleptonica-dev

    git clone https://github.com/brunomacabeusbr/pyslibtesseract.git

    cd pyslibtesseract

    cd src/cppcode/ && cmake . && make && cd ../.. && sudo python3 setup.py install

## To use

### Start

You must create a object of TesseractConfig:

    config_single_char = TesseractConfig(psm=PageSegMode.PSM_SINGLE_CHAR)

    config_line = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE)

    config_line_portuguese_brazilian = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE, lang='pt-br')

Possible PSM (page segmentation mode) are:

    PSM_OSD_ONLY

    PSM_AUTO_OSD

    PSM_AUTO_ONLY

    PSM_AUTO

    PSM_SINGLE_COLUMN

    PSM_SINGLE_BLOCK_VERT_TEX

    PSM_SINGLE_BLOCK

    PSM_SINGLE_LINE

    PSM_SINGLE_WORD

    PSM_CIRCLE_WORD

    PSM_SINGLE_CHAR

    PSM_SPARSE_TEXT

    PSM_SPARSE_TEXT_OSD

    PSM_COUNT

You can set variables of Tesseract:

    config_single_char.add_variable('tessedit_char_whitelist', 'QWERTYUIOPASDFGHJKLZXCVBNM')

### Read

The first parameter is always the configuration and the second parameter is always the image path

Read a pharese



    >>> LibTesseract.simple_read(config_line, 'phrase.png')

    the book is on the table

Read a pharese and say confidence in each sentence



    >>> LibTesseract.read_and_get_confidence_word(config_line, 'phrase.png')

    [('he', 82.19984436035156), ('is', 84.98550415039062), ('readlnq', 75.25213623046875), ('the', 74.60755157470703), ('book', 85.8053207397461)]

Read a char, say confidence and other possible characters



    >>> LibTesseract.read_and_get_confidence_char(config_single_char, 'char.png')

    [('E', 58.27500915527344), ('Y', 56.93630599975586), ('F', 56.4453125), ('T', 51.12168884277344), ('Q', 47.19916534423828), ('W', 46.1181640625), ('V', 45.31656265258789), ('G', 43.49636459350586)]

### hOCR

If you want a return with hOCR format, you need a create config with `hocr=True`

    >>> config_line_with_hocr = TesseractConfig(psm=PageSegMode.PSM_SINGLE_LINE, hocr=True)

    

or edit a alredy exist config

    >>> config_line.hocr = True

    

Then, use a method `simple_read`

    >>> LibTesseract.simple_read(config_line_with_hocr, 'phrase.png')

      


       

        

         the book is on the table

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/macabeus/pyslibtesseract

Awesome Lists containing this project

README