Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/hiyali/pilgen
The aim of this repository is to generate datasets (image & its label) for OCR training.
https://github.com/hiyali/pilgen
dataset generate image-label ocr uighur uyghur
Last synced: about 1 month ago
JSON representation
The aim of this repository is to generate datasets (image & its label) for OCR training.
- Host: GitHub
- URL: https://github.com/hiyali/pilgen
- Owner: hiyali
- Created: 2020-02-21T12:36:40.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-22T10:27:39.000Z (about 4 years ago)
- Last Synced: 2024-03-01T04:38:04.123Z (10 months ago)
- Topics: dataset, generate, image-label, ocr, uighur, uyghur
- Language: Python
- Homepage:
- Size: 506 KB
- Stars: 7
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Pilgen - (WIP: optimizing)
Python Image-Label dataset Generator for OCR
## Generate
```bash
python3 generate.py --lang ug --count 100 --out-dir data/
```This command will output `100` images into folder `data/images/`, filename pattern is `'word_{}.jpg'.format(line_num)`, exmaple:
```
data/images/word_1.jpg
data/images/word_2.jpg
...
data/images/word_100.jpg
```and a `gt.txt` file, its content pattern is `'{}\t{}'.format(filepath, word)`, like below:
```
data/images/word_1.jpg ئانا
data/images/word_2.jpg تىلىم
...
data/images/word_100.jpg گۈللە```
## Supported languages
* [x] ug - Uyghur (Uighur)
* [ ] other langs may will come## FAQ
1. How use your own corpus?
Ref: [#2](https://github.com/hiyali/pilgen/issues/2)
2. Uyghur words are separated in image?
Ref: [#2](https://github.com/hiyali/pilgen/issues/2)
## Test
```bash
python3 test.py
```## Develop environment
* Ubuntu 18.04.1
* Python 3.6.9## Author
Salam Hiyali
## Contribute
> Feel free
## License
MIT