Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/meh/ruby-tesseract-ocr
A Ruby wrapper library to the tesseract-ocr API.
https://github.com/meh/ruby-tesseract-ocr
ruby rubynlp tesseract-ocr wrapper
Last synced: 2 months ago
JSON representation
A Ruby wrapper library to the tesseract-ocr API.
- Host: GitHub
- URL: https://github.com/meh/ruby-tesseract-ocr
- Owner: meh
- Created: 2011-11-26T16:56:35.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2017-07-02T12:07:20.000Z (almost 7 years ago)
- Last Synced: 2024-01-07T10:54:36.560Z (6 months ago)
- Topics: ruby, rubynlp, tesseract-ocr, wrapper
- Language: Ruby
- Homepage:
- Size: 562 KB
- Stars: 629
- Watchers: 27
- Forks: 74
- Open Issues: 24
-
Metadata Files:
- Readme: README.md
Lists
- nlp-with-ruby - tesseract-ocr - (Optical Character Recognition / Text-to-Speech-to-Text)
- awesome-ocr - ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby (Software / OCR libraries by programming language)
- awesome-ocr - ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby (7. <a name='Languagedetection'></a>Language detection / 7.3. <a name='OCRlibrariesbyprogramminglanguage'></a>OCR libraries by programming language)
README
ruby-tesseract - Ruby bindings and wrapper
==========================================
This wrapper binds the TessBaseAPI object through ffi-inline (which means it
will work on JRuby too) and then proceeds to wrap said API in a more ruby-esque
Engine class.Making it work
--------------
To make this library work you need tesseract-ocr and leptonica libraries and
headers and a C++ compiler.The gem is called `tesseract-ocr`.
If you're on a distribution that separates the libraries from headers, remember
to install the *-dev* package.On Debian you will need to install `libleptonica-dev` and `libtesseract-dev`.
Examples
--------
Following are some examples that show the functionalities provided by
tesseract-ocr.### Basic functionality of tesseract
```ruby
require 'tesseract'e = Tesseract::Engine.new {|e|
e.language = :eng
e.blacklist = '|'
}e.text_for('test/first.png').strip # => 'ABC'
```You can pass to `#text_for` either a path, an IO object, a string containing
the image or an object that responds to `#to_blob` (for example
Magick::Image), keep in mind that the format has to be supported by leptonica.### Accessing advanced features
With advanced features you get access to blocks, paragraphs, lines, words and
symbols.Replace **level** in method names with either `block`, `paragraph`, `line`,
`word` or `symbol`.The following kind of accessors need a block to be passed and they pass to the
block each `Element` object. The Element object has various getters to access
certain features, I'll talk about them later.The methods are:
* `each_level`
* `each_level_for`
* `each_level_at`The following accessors instead return an `Array` of `Element`s with cached
getters, the getters are cached beacause the values accessible in the `Element`
are linked to the state of the internal API, and that state changes if you
access something else.The methods are:
* `levels`
* `levels_for`
* `levels_at`Again, to `*_for` methods you can pass what you can pass to a `#text_for`.
Each `Element` object has the following getters:
* `bounding_box`, this will return the box where the element is confined into
* `binary_image`, this will return the bichromatic image of the element
* `image`, this will return the image of the element
* `baseline`, this will return the line where the text is with a pair of
coordinates
* `orientation`, this will return the orientation of the element
* `text`, this will return the text of the element
* `confidence`, this will return the confidence of correctness for the element`Block` elements also have `type` accessors that specify the type of the block.
`Word` elements also have `font_attributes`, `from_dictionary?` and `numeric?`
getters.`Symbol` elements also have `superscript?`, `subscript?` and `dropcap?`
getters.### hOCR
```ruby
require 'tesseract'e = Tesseract::Engine.new {|e|
e.language = :eng
e.blacklist = '|'
}puts e.hocr_for('test/first.png')
```You can pass to `#hocr_for` either a path, an IO object, a string containing
the image or an object that responds to `#to_blob` (for example
Magick::Image), keep in mind that the format has to be supported by leptonica.Please note you have to pass `#hocr_for` the page you want to get the output of
as well.Using the binary
----------------
You can also use the shipped executable in the following way:```bash
> tesseract.rb -h
Usage: tesseract [options]
--path PATH datapath to set
-l, --language LANGUAGE language to use
-m, --mode MODE mode to use
-p, --psm MODE page segmentation mode to use
-u, --unlv output in UNLV format
-c, --confidence output the mean confidence of the recognition
-C, --config PATH... config files to load
-b, --blacklist LIST blacklist the following chars
-w, --whitelist LIST whitelist the following chars
> tesseract.rb test/first.png
ABC
> tesseract.rb -c test/first.png
86
```License
-------
The license is BSD one clause.