Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hiyali/pilgen

The aim of this repository is to generate datasets (image & its label) for OCR training.
https://github.com/hiyali/pilgen

dataset generate image-label ocr uighur uyghur

Last synced: about 1 month ago
JSON representation

The aim of this repository is to generate datasets (image & its label) for OCR training.

Awesome Lists containing this project

README

        

# Pilgen - (WIP: optimizing)

Python Image-Label dataset Generator for OCR

## Generate

```bash
python3 generate.py --lang ug --count 100 --out-dir data/
```

This command will output `100` images into folder `data/images/`, filename pattern is `'word_{}.jpg'.format(line_num)`, exmaple:
```
data/images/word_1.jpg
data/images/word_2.jpg
...
data/images/word_100.jpg
```

and a `gt.txt` file, its content pattern is `'{}\t{}'.format(filepath, word)`, like below:

```
data/images/word_1.jpg ئانا
data/images/word_2.jpg تىلىم
...
data/images/word_100.jpg گۈللە

```

## Supported languages

* [x] ug - Uyghur (Uighur)
* [ ] other langs may will come

## FAQ

1. How use your own corpus?

Ref: [#2](https://github.com/hiyali/pilgen/issues/2)

2. Uyghur words are separated in image?

Ref: [#2](https://github.com/hiyali/pilgen/issues/2)

## Test

```bash
python3 test.py
```

## Develop environment

* Ubuntu 18.04.1
* Python 3.6.9

## Author

Salam Hiyali

## Contribute

> Feel free

## License

MIT