https://github.com/shreeshrii/tesstrain-xsa

Finetune Training and OCR evaluation of Tesseract for Sabaean language in Ancient South Arabian script
https://github.com/shreeshrii/tesstrain-xsa

Last synced: 2 months ago
JSON representation

Finetune Training and OCR evaluation of Tesseract for Sabaean language in Ancient South Arabian script

Host: GitHub
URL: https://github.com/shreeshrii/tesstrain-xsa
Owner: Shreeshrii
License: apache-2.0
Created: 2020-03-07T17:30:39.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2020-03-23T13:23:49.000Z (about 5 years ago)
Last Synced: 2025-01-26T01:46:41.185Z (4 months ago)
Size: 72 MB
Stars: 3
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # tesstrain-xsa

Finetune Training and OCR evaluation of Tesseract 5.0.0 Alpha for Ancient South Arabian script using

 [tesstrain Training workflow for Tesseract 4 as a Makefile](https://github.com/tesseract-ocr/tesstrain). Certain file locations and scripts have been modified compared to source repos.

OCR evaluation is done using [ISRI Analytic Tools for OCR Evaluation with UTF-8 support](https://github.com/eddieantonio/ocreval) and  and [The ocrevalUAtion tool](https://sites.google.com/site/textdigitisation/ocrevaluation).

## [best_xsa1 - Ancient South Arabian script - Version 1](https://github.com/Shreeshrii/tesstrain-xsa/releases/tag/best_xsa1)

Replace the top layer training was done using two fonts. The sample training text was scraped via google search.

### Training Steps (links to files as of Version 1)

* Make [training text](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1/langdata/xsa.txt)

* List [available fonts that can render the training text](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1/langdata/xsa.fontslist.txt)

* Update fonts directory unicodefontdir in [txt2lstmf.sh](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1/txt2lstmf.sh)

* Run [txt2lstmf.sh](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1/txt2lstmf.sh) to create the images, ground truth and lstmf files in [gt/xsa](https://github.com/Shreeshrii/tesstrain-xsa/tree/best_xsa1/gt/xsa)

* Run [trainlayer.sh](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1/trainlayer.sh) to download the starting ara.traineddata and other files and start the training via makefile

* Run [checkpointeval.sh](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1/checkpointeval.sh) to evaluate the accuracy of different checkpoints.

* Resulting traineddata file, which can be used as starting model for further training, is at [best_xsa1. traineddata](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1. traineddata).

### Evaluation Results

| Font  	|  Accuracy 	|

|---	|---	|

|Quivira|   83.30%	|

|Segoe_UI_Historic|   81.60% 	|

See [reports](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa1/reports/checkpointeval.txt) for more details. 

## best_xsa2 - Ancient South Arabian script - Version 2

Replace the top layer training was done using four Unicode fonts. The training text was scraped via google search. A small subset was created by copying Latin transcription text from [CSAI Inscriptions](http://dasi.cnr.it/index.php?id=79&prjId=1&corId=5&colId=0&navId=522207406&recId=2149) and converting to Unicode via [a sed script](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa2/langdata/latin2unicode.sh). 

Qataban, one of the fonts used, was [rendering space as a square box with 00 20](langdata/nospace.Qataban.png) in it. 

So, line images for it were created with a wordlist type of [training text with no spaces](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa2/langdata/nospace.training_text) in it. [Training text with spaces](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa2/langdata/xsa.training_text) was used for the [other three fonts](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa2/langdata/xsa.fontslist.txt).

### Evaluation Results

| Font  	|  Accuracy 	|

|---	|---	|

|  Noto_Sans_Old_South_Arabian 	|  95.21% 	|

| Qataban  	| 72.63%  	|

|Quivira|   95.87%	|

|Segoe_UI_Historic|  97.81% 	|

See [reports](https://github.com/Shreeshrii/tesstrain-xsa/blob/best_xsa2/reports/checkpointeval.txt) for more details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shreeshrii/tesstrain-xsa

Awesome Lists containing this project

README