https://github.com/lucidprogrammer/ocraccuracyreporter

ocr python text-analysis

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/lucidprogrammer/ocraccuracyreporter
Owner: lucidprogrammer
Created: 2018-02-13T12:56:53.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-02-16T02:27:00.000Z (over 8 years ago)
Last Synced: 2025-08-29T23:30:20.124Z (10 months ago)
Topics: ocr, python, text-analysis
Language: Python
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.rst

Awesome Lists containing this project

README

          ============

Overview

============

Your OCR pipeline may have various stages and may use various tools.

You need a simple way to run sample/s as a whole or piece by piece and have a way to say that the OCR accuracy is say 98%.

=========

Usage

=========

>>> pip install ocraccuracyreporter

>>> from ocraccuracyreporter.oar import oar

.. topic:: initialising the reporter

>>> oreport = oar(expected='john', given='joh', label='name')

>>> print(oreport)

>>> name,john,joh,86,100,86,86,94,1

or you may have various ocr results for the same item, so you may want to initialise the expected alone

with or without a label

>>> oreport = oar(expected='john', label='name')

>>> oreport.given = 'joh'

>>> repr(oreoprt)

if you are creating a csv report with header info

>>>label,expected,given,ratio,partial_ratio,token_sort_ratio,token_set_ratio,jaro_winkler,distance

  name,john,joh,86,100,86,86,94,1

  .. topic:: Items in the report

  ratio - uses pure Levenshtein Distance based matching

          (100 - means perfect match)

  partial_ratio - matches based on best substrings

  token_sort_ratio - tokenizes the strings and sorts them alphabetically

  token_set_ratio - tokenizes the strings and compared the intersection

  jaro_winkler - this algorithm giving more weight to common prefix

                 (for example, some parts are good, missing others)

  distance - this shows how many characters are really different in given

             compared to expected

=========

Class variables

=========

label  - a meaningful name for the ocr string.

expected - expected result

given - result you got out of ocr pipeline

total_expected_char_count - calculated expected char count

total_expected_word_count - calculated expected word count

total_given_char_count - calculated given char count

total_given_word_count - calculated given word count

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lucidprogrammer/ocraccuracyreporter

Awesome Lists containing this project

README