https://github.com/amirabbasasadi/shotor
Free Persian Word Level OCR Dataset
https://github.com/amirabbasasadi/shotor
dataset image-processing machine-learning nlp ocr
Last synced: 4 months ago
JSON representation
Free Persian Word Level OCR Dataset
- Host: GitHub
- URL: https://github.com/amirabbasasadi/shotor
- Owner: amirabbasasadi
- Created: 2020-07-25T14:07:13.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2020-08-01T18:44:13.000Z (almost 6 years ago)
- Last Synced: 2023-03-08T10:03:28.360Z (over 3 years ago)
- Topics: dataset, image-processing, machine-learning, nlp, ocr
- Homepage:
- Size: 54.9 MB
- Stars: 15
- Watchers: 2
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Shotor
## Word Level OCR Dataset for Persian Language
Shotor (means camel in Persian) is a free synthetic dataset for Word Level OCR.

The current version contains 120000 grayscale 50*100 images and corresponding words. The words contain only alphabet.
**Note: To train a robust model, apply augmentations like scaling, translation, additive noise and ... on the images.**
To see an example of using the Shotor dataset see this notebook:
[A simple word level OCR for Persian Language using Pytorch and OpenCV](https://github.com/amirabbasasadi/PersianOCR)
I used these resourses to create word lists:
- Persian Wikipedia
- [Ganjoor Website](https://ganjoor.net/)
- [Persian Spell Checking Data](https://github.com/reza1615/Persian-Spell-checker) by [reza1615](https://github.com/reza1615)
The images have been generated using multiple fonts:
- a few fonts from https://rastikerdar.github.io/
- and some fonts from https://www.fontirani.ir/
Created by: Amirabbas Asadi (amir137825@gmail.com)