https://github.com/hiyali/pilgen

The aim of this repository is to generate datasets (image & its label) for OCR training.
https://github.com/hiyali/pilgen

dataset generate image-label ocr uighur uyghur

Last synced: 4 months ago
JSON representation

The aim of this repository is to generate datasets (image & its label) for OCR training.

Host: GitHub
URL: https://github.com/hiyali/pilgen
Owner: hiyali
Created: 2020-02-21T12:36:40.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-11-22T10:27:39.000Z (almost 5 years ago)
Last Synced: 2024-03-01T04:38:04.123Z (over 1 year ago)
Topics: dataset, generate, image-label, ocr, uighur, uyghur
Language: Python
Homepage:
Size: 506 KB
Stars: 7
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Pilgen - (WIP: optimizing)

Python Image-Label dataset Generator for OCR

## Generate

```bash

python3 generate.py --lang ug --count 100 --out-dir data/

```

This command will output `100` images into folder `data/images/`, filename pattern is `'word_{}.jpg'.format(line_num)`, exmaple:

```

data/images/word_1.jpg

data/images/word_2.jpg

...

data/images/word_100.jpg

```

and a `gt.txt` file, its content pattern is `'{}\t{}'.format(filepath, word)`, like below:

```

data/images/word_1.jpg	ئانا

data/images/word_2.jpg	تىلىم

...

data/images/word_100.jpg	گۈللە

```

## Supported languages

* [x] ug - Uyghur (Uighur)

* [ ] other langs may will come

## FAQ

1. How use your own corpus?

Ref: [#2](https://github.com/hiyali/pilgen/issues/2)

2. Uyghur words are separated in image?

Ref: [#2](https://github.com/hiyali/pilgen/issues/2)

## Test

```bash

python3 test.py

```

## Develop environment

* Ubuntu 18.04.1

* Python 3.6.9

## Author

Salam Hiyali

## Contribute

> Feel free

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hiyali/pilgen

Awesome Lists containing this project

README