https://github.com/shreeshrii/kraken_devanagari
Kraken models for Devanagari
https://github.com/shreeshrii/kraken_devanagari
devanagari kraken ocr sanskrit training-data
Last synced: 2 months ago
JSON representation
Kraken models for Devanagari
- Host: GitHub
- URL: https://github.com/shreeshrii/kraken_devanagari
- Owner: Shreeshrii
- Created: 2020-02-24T14:39:12.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-03-03T04:39:51.000Z (about 5 years ago)
- Last Synced: 2025-01-26T02:32:02.523Z (4 months ago)
- Topics: devanagari, kraken, ocr, sanskrit, training-data
- Language: Shell
- Size: 44.1 MB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# kraken_devanagari
Experimental Devanagari Recognition model for [kraken](https://github.com/mittagessen/kraken).## devanew_best.mlmodel
Recognizer for Devanagari script for kraken (uses the old `bbox` type segmentation)### Training
The model was trained using `kraken version 2.0.8` on synthetic training data (line images from ground truth text files and fonts) generated using [tesseract's text2image](https://github.com/tesseract-ocr/tesseract) and [kraken's linegen](https://github.com/mittagessen/kraken/blob/master/kraken/linegen.py). See [log](https://github.com/Shreeshrii/kraken_devanagari/blob/master/devanew.log) for details of training.
* Training set 38761 lines,
* Validation set 4307 lines,
* Alphabet 133 symbols.
* Accuracy on Validation set - 0.9795386542342217.Sample of training data used is available in `devatrain` and `legacytrain` directories.
Complete manifest of training data is available in [devanew-manifest.txt](https://github.com/Shreeshrii/kraken_devanagari/blob/master/devanew-manifest.txt).
### Evaluation
The model was evaluated on similar line images and had average accuracy of approximately 95%.
* devatest - 95.48% Accuracy
* legacytest - 95.48% Accuracy### Conclusions
The segmentation algorithm of kraken is suited for Latin script and fails for certain types of Devanagari page images.
The accuracy on page images with typefaces unlike the images in training data will be lower.
The model can be further finetuned based on requirements eg. for one particular font or for one particular scanned book, which will give better accuracy.