Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tomahim/py-image-dataset-generator
Get a large image dataset with minimal effort by grabbing image through the web and generate new ones by image augmentation.
https://github.com/tomahim/py-image-dataset-generator
Last synced: 8 days ago
JSON representation
Get a large image dataset with minimal effort by grabbing image through the web and generate new ones by image augmentation.
- Host: GitHub
- URL: https://github.com/tomahim/py-image-dataset-generator
- Owner: tomahim
- License: mit
- Created: 2018-02-08T12:34:11.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-06-29T17:30:56.000Z (over 4 years ago)
- Last Synced: 2024-08-02T15:38:02.250Z (3 months ago)
- Language: Python
- Homepage:
- Size: 62.5 KB
- Stars: 213
- Watchers: 8
- Forks: 41
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Image dataset generator for Deep learning projects
[![Join the chat at https://gitter.im/py-image-dataset-generator/Lobby](https://badges.gitter.im/py-image-dataset-generator/Lobby.svg)](https://gitter.im/py-image-dataset-generator/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
### Get a large image dataset with minimal effort
This tool **automatically collect images** from Google or Bing and optionally resize them.
```
python download.py "funny cats" -limit=100 -dest=folder_name -resize=250x250
```Then you can **randomly generate new images** with image augmentation from an existing folder. It will add noise, rotate, transform, flip, blur on random images.
```
python augmentation.py -folder=my_folder/funny_cats -limit=10000
```TADA ! In few seconds you will get 10 000 different images of funny cats to train your favorite deep learning algorithm !
### Table of content
* [Pre-requirements](#pre-requirements)
* [Installation](#installation)
* [Run unit tests](#run-unit-tests)
* [Usage](#usage)
* [Download images](#download-images-from-the-web)
* [Image augmentation](#image-augmentation)
* [Create a custom image augmentation pipeline](#create-a-custom-image-augmentation-pipeline)
* [Common issues](#common-issues)
* [Acknowledgments](#acknowledgments)### Pre-requirements
This project is tested with Python 3.6.4 and more.
*Linux*
- chromium-browser package (`sudo apt-get install chromium-browser`)
*Windows*
- Chrome should be installed
- [Microsoft Visual C++ Build Tools](https://www.scivision.co/python-windows-visual-c++-14-required/) (scikit image dependency, [see for more info](https://www.scivision.co/python-windows-visual-c++-14-required/))### Installation
Git clone the project
Get the python dependencies
```
pip install -r requirements.txt
```### Run unit tests
```
python -m unittest discover
```### Usage
#### Download images from the web
```
python download.py "red car" -limit=150 -dest=folder_name -resize=250x250
```
After running this command, you will have 150 images of *red cars* (resized 250px by 250px) in the /folder_name/red_car folder.You can find all possible parameters in the table below (also available with the `--help` parameter) :
Parameters | Description
--- | ---
Keyword *(required)* | The first parameter should be a keyword describing the images to search for.
`python download.py "red car"`
Destination folder
*-dest or -d* | Specify the destination folder to save files (default: images/)
`python download.py "red car" -dest=your_folder`
Limit number
*-limit or -l* | Specify the number of files to download (default: 50). See the note below for the maximum limit.
`python download.py "red car" -limit=200`
Thumbnail only
*-thumbnail or -thumb* | Download the thumbnail instead of the full original image
`python download.py "red car" -thumbnail`
Resize image
*-resize* | Resize downloaded images on the fly, to get a dataset formatted with the same size (default: no resizing). The parameter should be a couple of number representing the width and height (32x32 will ouput 32px x 32px image files)
`python download.py "red car" -resize=32x32"`
Grab source
*-source, -src or -allsources* | Choose the website to grab images : Google and/or Bing (default: Google). *-allsources* parameter can be use to. It will equally mix image files from all available sources
`python download.py "red car" -source=Google` (single source)
`python download.py "red car" -source=Google -source=Bing` (multi source)
`python download.py "red car" -allsources` (all sources)Note : There are known limitations for the total number of images you can download in one use of the `download.py` script. Bing and Google won't let you download more than 800 images each, so the maximum for one download is around 1600 images if you use the `-allsources` parameter.
#### Image augmentation
```
python augmentation.py -folder=your_folder -limit=10000
```10 000 augmented images will output by default to the "output" folder inside your image folder.
By default, this command will randomly apply these image transformations :
- Blur image (with a probability of 10%)
- Add Random noise (with a probability of 50%)
- Horizontal flip (with a probability of 30%)
- Left or Right rotation between 0 or 25 degree (with a probability of 50%)- *... to be completed*
You can customize these default values by editing the `augmentation_config.py` file or by making [your own image augmentation pipeline](#create-a-custom-image-augmentation-pipeline)
You can find all possible parameters in the table below (also available with the `--help` parameter) :
Parameters | Description
--- | ---
Keyword *(required)* | Folder input path containing images that will be augmented.`
Destination folder
*-dest or -d* | Specify the destination folder to save augmented files (default: /your_folder/output)
`python augmentation.py -folder=your_folder -limit=50 -dest=other_folder`
Limit number
*-limit or -l* | Number of image to generate by augmentation (default: 50)#### Create a custom image augmentation pipeline
```python
from augmentation.augmentation import DatasetGeneratorpipeline = DatasetGenerator(
folder_path="images/red_car/",
num_files=5000,
save_to_disk=True,
folder_destination="images/red_car/results"
)
pipeline.rotate(probability=0.5, max_left_degree=25, max_right_degree=25)
pipeline.random_noise(probability=0.5)
pipeline.blur(probability=0.5)
pipeline.vertical_flip(probability=0.1)
pipeline.horizontal_flip(probability=0.2)
pipeline.resize(probability=1, width=20, height=20)
pipeline.execute()
```That's it !
### Common issues
**WebDriverException: Message: unknown error: cannot find Chrome binary**
Make sure chromedriver is well installed on your PATH (run the `which chromedriver` command on Linux and then `echo $PATH`). Also Chrome should be installed on your machine (or the `chromium-package` for Linux).
You can install the chromedriver with this command ([more information here](https://pypi.python.org/pypi/chromedriver_installer)):
`pip install chromedriver_installer --install-option="--chromedriver-version=2.35"`**error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": [http://landinghub.visualstudio.com/visual-cpp-build-tools](https://www.scivision.co/python-windows-visual-c++-14-required/)**
As this repo use scikit-image for image processing, on Windows you need Microsoft Visual C++ Build Tools which is provided with Visual Studio (think to check the C++ options on installation). You can install it with the link below.
### Acknowledgments
- This repo is *largely inspired* by the work of Marcus Bloice on his [Augmentor](https://arxiv.org/abs/1708.04680) project. Many thanks for the great work and the useful documentation.
- I also pick some ideas from [this great series of articles](https://www.pyimagesearch.com/2017/12/11/image-classification-with-keras-and-deep-learning/) for the *automatic* part to grab images.
The goal of this repo is mainly to provide the smaller python library as possible to generate an image dataset, without a big framework like Keras, Tflearn etc, which can be hard to configure and install for new people working on Data Science / AI.