https://github.com/tomahim/py-image-dataset-generator

Get a large image dataset with minimal effort by grabbing image through the web and generate new ones by image augmentation.
https://github.com/tomahim/py-image-dataset-generator

Last synced: 3 months ago
JSON representation

Get a large image dataset with minimal effort by grabbing image through the web and generate new ones by image augmentation.

Host: GitHub
URL: https://github.com/tomahim/py-image-dataset-generator
Owner: tomahim
License: mit
Created: 2018-02-08T12:34:11.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2020-06-29T17:30:56.000Z (about 5 years ago)
Last Synced: 2024-11-05T16:46:26.365Z (8 months ago)
Language: Python
Homepage:
Size: 62.5 KB
Stars: 218
Watchers: 8
Forks: 41
Open Issues: 11
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Image dataset generator for Deep learning projects

[![Join the chat at https://gitter.im/py-image-dataset-generator/Lobby](https://badges.gitter.im/py-image-dataset-generator/Lobby.svg)](https://gitter.im/py-image-dataset-generator/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

### Get a large image dataset with minimal effort

This tool **automatically collect images** from Google or Bing and optionally resize them. 

```

python download.py "funny cats" -limit=100 -dest=folder_name -resize=250x250

```

Then you can **randomly generate new images** with image augmentation from an existing folder. It will add noise, rotate, transform, flip, blur on random images.

```

python augmentation.py -folder=my_folder/funny_cats -limit=10000

```

TADA ! In few seconds you will get 10 000 different images of funny cats to train your favorite deep learning algorithm !

### Table of content

* [Pre-requirements](#pre-requirements)

* [Installation](#installation)

* [Run unit tests](#run-unit-tests)

* [Usage](#usage)

    * [Download images](#download-images-from-the-web)

    * [Image augmentation](#image-augmentation)

    * [Create a custom image augmentation pipeline](#create-a-custom-image-augmentation-pipeline)

* [Common issues](#common-issues)

* [Acknowledgments](#acknowledgments)

### Pre-requirements

This project is tested with Python 3.6.4 and more.

*Linux*

- chromium-browser package (`sudo apt-get install chromium-browser`)

*Windows*

- Chrome should be installed

- [Microsoft Visual C++ Build Tools](https://www.scivision.co/python-windows-visual-c++-14-required/) (scikit image dependency, [see for more info](https://www.scivision.co/python-windows-visual-c++-14-required/))

### Installation

Git clone the project

Get the python dependencies

```

pip install -r requirements.txt

```

### Run unit tests

```

python -m unittest discover

```

### Usage

#### Download images from the web

```

python download.py "red car" -limit=150 -dest=folder_name -resize=250x250

```

    

After running this command, you will have 150 images of *red cars* (resized 250px by 250px) in the /folder_name/red_car folder. 

You can find all possible parameters in the table below (also available with the `--help` parameter) :

Parameters  | Description

---    | --- 

Keyword *(required)* | The first parameter should be a keyword describing the images to search for. 

 `python download.py "red car"`

Destination folder 
*-dest or -d* | Specify the destination folder to save files (default: images/) 

 `python download.py "red car" -dest=your_folder`

Limit number 
*-limit or -l* | Specify the number of files to download (default: 50). See the note below for the maximum limit. 

 `python download.py "red car" -limit=200`

Thumbnail only 
*-thumbnail or -thumb* | Download the thumbnail instead of the full original image 

   `python download.py "red car" -thumbnail`

Resize image 
*-resize* | Resize downloaded images on the fly, to get a dataset formatted with the same size (default: no resizing). The parameter should be a couple of number representing the width and height (32x32 will ouput 32px x 32px image files) 

  `python download.py "red car" -resize=32x32"`

Grab source 
*-source, -src or -allsources* |  Choose the website to grab images : Google and/or Bing (default: Google). *-allsources* parameter can be use to. It will equally mix image files from all available sources 

 `python download.py "red car" -source=Google` (single source) 
 `python download.py "red car" -source=Google -source=Bing` (multi source)
 `python download.py "red car" -allsources` (all sources)

Note : There are known limitations for the total number of images you can download in one use of the `download.py` script. Bing and Google won't let you download more than 800 images each, so the maximum for one download is around 1600 images if you use the `-allsources` parameter.

#### Image augmentation

```

python augmentation.py -folder=your_folder -limit=10000

```

10 000 augmented images will output by default to the "output" folder inside your image folder.

By default, this command will randomly apply these image transformations :

- Blur image (with a probability of 10%)

- Add Random noise (with a probability of 50%)

- Horizontal flip (with a probability of 30%)

- Left or Right rotation between 0 or 25 degree (with a probability of 50%)

- *... to be completed*

You can customize these default values by editing the `augmentation_config.py` file or by making [your own image augmentation pipeline](#create-a-custom-image-augmentation-pipeline)

You can find all possible parameters in the table below (also available with the `--help` parameter) :

Parameters  | Description

---    | --- 

Keyword *(required)* | Folder input path containing images that will be augmented.`

Destination folder 
*-dest or -d* | Specify the destination folder to save augmented files (default: /your_folder/output) 

 `python augmentation.py -folder=your_folder -limit=50 -dest=other_folder`

Limit number 
*-limit or -l* | Number of image to generate by augmentation (default: 50)

#### Create a custom image augmentation pipeline

```python

from augmentation.augmentation import DatasetGenerator

pipeline = DatasetGenerator(

    folder_path="images/red_car/",

    num_files=5000,

    save_to_disk=True,

    folder_destination="images/red_car/results"

)

pipeline.rotate(probability=0.5, max_left_degree=25, max_right_degree=25)

pipeline.random_noise(probability=0.5)

pipeline.blur(probability=0.5)

pipeline.vertical_flip(probability=0.1)

pipeline.horizontal_flip(probability=0.2)

pipeline.resize(probability=1, width=20, height=20)

pipeline.execute()

```

That's it !

### Common issues

**WebDriverException: Message: unknown error: cannot find Chrome binary**

Make sure chromedriver is well installed on your PATH (run the `which chromedriver` command on Linux and then `echo $PATH`). Also Chrome should be installed on your machine (or the `chromium-package` for Linux).

You can install the chromedriver with this command ([more information here](https://pypi.python.org/pypi/chromedriver_installer)):

`pip install chromedriver_installer --install-option="--chromedriver-version=2.35"`

**error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": [http://landinghub.visualstudio.com/visual-cpp-build-tools](https://www.scivision.co/python-windows-visual-c++-14-required/)**

As this repo use scikit-image for image processing, on Windows you need Microsoft Visual C++ Build Tools which is provided with Visual Studio (think to check the C++ options on installation). You can install it with the link below.

### Acknowledgments

- This repo is *largely inspired* by the work of Marcus Bloice on his [Augmentor](https://arxiv.org/abs/1708.04680) project. Many thanks for the great work and the useful documentation.

- I also pick some ideas from [this great series of articles](https://www.pyimagesearch.com/2017/12/11/image-classification-with-keras-and-deep-learning/) for the *automatic* part to grab images.

The goal of this repo is mainly to provide the smaller python library as possible to generate an image dataset, without a big framework like Keras, Tflearn etc, which can be hard to configure and install for new people working on Data Science / AI.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tomahim/py-image-dataset-generator

Awesome Lists containing this project

README