https://github.com/stratospark/keras-multiprocess-image-data-generator

Accelerating Deep Learning with Multiprocess Image Augmentation in Keras
https://github.com/stratospark/keras-multiprocess-image-data-generator
deep-learning keras multiprocessing tensorflow
Last synced: 3 months ago
JSON representation
Accelerating Deep Learning with Multiprocess Image Augmentation in Keras
Host: GitHub
URL: https://github.com/stratospark/keras-multiprocess-image-data-generator
Owner: stratospark
License: mit
Created: 2017-02-01T00:02:50.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2018-06-19T22:14:54.000Z (about 7 years ago)
Last Synced: 2025-03-29T04:16:30.810Z (3 months ago)
Topics: deep-learning, keras, multiprocessing, tensorflow
Language: Jupyter Notebook
Homepage: http://blog.stratospark.com/multiprocess-image-augmentation-keras.html
Size: 13 MB
Stars: 317
Watchers: 7
Forks: 67
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        
# Accelerating Deep Learning with Multiprocess Image Augmentation in Keras

![Benchmark Results](./results.png)

**Code available @ https://github.com/stratospark/keras-multiprocess-image-data-generator**

* [Introduction](#Introduction)

* [Benchmark: CIFAR10 - In Memory Performance, Image Generation Only](#Benchmark:-CIFAR10---In-Memory-Performance,-Image-Generation-Only)

* [Benchmark: CIFAR10 - In Memory Performance, Image Generation with GPU Training](#Benchmark:-CIFAR10---In-Memory-Performance,-Image-Generation-with-GPU-Training)

* [Benchmark: Dogs vs. Cats - On Disk Performance, Image Generation witih GPU Training](#Benchmark:-Dogs-vs.-Cats---On-Disk-Performance,-Image-Generation-witih-GPU-Training)

## Introduction

**TLDR: By adding multiprocessing support to Keras ImageDataGenerator, benchmarking on a 6-core i7-6850K and 12GB TITAN X Pascal: 3.5x speedup of training with image augmentation on in memory datasets, 3.9x speedup of training with image augmentation on datasets streamed from disk.**

When exploring Deep Learning models, it isn't only beneficial to have good performance for the final training run. Accelerating training speed means more network models can be tried and more hyperparameter settings can be explored in the same amount of time. **The more that we can experiment, the better our results can become.**

In my experience with [training a moderately sized network](http://blog.stratospark.com/deep-learning-applied-food-classification-deep-learning-keras.html) on my home desktop, I found one bottleneck to be creating additional images to augment my dataset. Keras provides an [ImageDataGenerator](https://keras.io/preprocessing/image/) class that can take images, in memory or on disk, and create many different variations based on a set of parameters: rotations, flips, zooms, altering colors, etc. For reference, here is a [great tutorial](https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html) on improving network accuracy with image augmentation.

![cat images augmented](https://blog.keras.io/img/imgclf/cat_data_augmentation.png)

While training my initial models, I was waiting upwards of an entire day to see enough results to decide what to change. I saw that I was taking nowhere near full advantage of my CPU or GPU. As a result, I decided to add some Python multiprocessing support to a fork of ImageDataGenerator. I was able to drastically cut my training time and was finally able to steer my experiments in the right direction!

For reference, I am using:

* Intel Core i7-6850K

* NVIDIA TITAN X Pascal 12GB

* 96GB RAM

* 64-bit Ubuntu 16.04

* Python 2.7.13 :: Continuum Analytics, Inc.

* Keras 1.2.1

* Tensorflow 0.12.1

You can use the multiprocessing-enabled ImageDataGenerator that is included with this repo as a drop-in replacement for the version that currently ships with Keras. If it makes sense, the code may get incorporated into the main branch at some point.

```python

import numpy as np

import pandas as pd

import keras as K

import matplotlib.pyplot as plt

import multiprocessing

import time

import collections

import sys

import signal

%matplotlib inline

```

    Using TensorFlow backend.

```python

# The original class can be imported like this:

# from keras.preprocessing.image import ImageDataGenerator

# We access the modified version through T.ImageDataGenerator

import tools.image as T

# Useful for checking the output of the generators after code change

try:

    from importlib import reload

    reload(T)

except:

    reload(T)

```

These are helper methods used throughout the notebook.

```python

def preprocess_img(img):

    img = img.astype(np.float32) / 255.0

    img -= 0.5

    return img * 2

```

```python

def plot_images(img_gen, title):

    fig, ax = plt.subplots(6, 6, figsize=(10, 10))

    plt.suptitle(title, size=32)

    plt.setp(ax, xticks=[], yticks=[])

    plt.tight_layout(rect=[0, 0.03, 1, 0.95])

    for (imgs, labels) in img_gen:

        for i in range(6):

            for j in range(6):

                if i*6 + j < 32:

                    ax[i][j].imshow(imgs[i*6 + j])

        break    

```

## Benchmark: CIFAR10 - In Memory Performance, Image Generation Only

[CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) is a toy dataset that includes 50,000 training images and 10,000 test images of shape 32x32x3.

It includes the following 10 classes: **airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck**

```python

from keras.datasets.cifar10 import load_data

from keras.utils.np_utils import to_categorical

(X_train, y_train), (X_test, y_test) = load_data()

y_train_cat = to_categorical(y_train)

y_test_cat = to_categorical(y_test)

```

Here is an example of how to set up a `multiprocessing.Pool` and add it as an argument to the ImageDataGenerator constructor. This is the only change to the class' public interface. If you leave out the `pool` parameter or set it to `None`, the generator will operate in its original single process mode.

```python

try:

    pool.terminate()

except:

    pass

n_process = 4

    

pool = multiprocessing.Pool(processes=n_process)

start = time.time()

gen = T.ImageDataGenerator(

     featurewise_center=False,

     samplewise_center=False,

     featurewise_std_normalization=False,

     samplewise_std_normalization=False,

     zca_whitening=False,

     rotation_range=45,

     width_shift_range=.1,

     height_shift_range=.1,

     shear_range=0.,

     zoom_range=0,

     channel_shift_range=0,

     fill_mode='nearest',

     cval=0.,

     horizontal_flip=True,

     vertical_flip=False,

     rescale=1/255.,

     #preprocessing_function=preprocess_img, # disable for nicer visualization

     dim_ordering='default',

     pool=pool # <-------------- Only change needed!

)

gen.fit(X_train)

X_train_aug = gen.flow(X_train, y_train_cat, seed=0)

print('{} process, duration: {}'.format(4, time.time() - start))

plot_images(X_train_aug, 'Augmented Images generated with {} processes'.format(n_process))

pool.terminate()

```

    4 process, duration: 0.0404160022736

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_18_1.png)

Now that we have verified that the images are being properly generated with multiple processes, we want to benchmark how the number of processes affects performance. Idealy, we would like to see speedups scale linearly with the number of processes added. However, as explained by [Amdahl's Law](https://en.wikipedia.org/wiki/Amdahl%27s_law), there are diminishing returns due to additional overhead.

The following benchmark will first test image augmentation without multiprocessing, then do a test for an increasing number of processes, up to a max of the number of logical CPUs your system has. It does multiple rounds of these tests so that we may average the results.

```python

durs = collections.defaultdict(list)

num_cores = 2

try:

    num_cores = multiprocessing.cpu_count()

except:

    pass

for j in range(10):

    print('Round', j)

    

    for num_p in range(0, num_cores + 1):

        pool = None

        if num_p > 0:

            pool = multiprocessing.Pool(processes=num_p)

            

        start = time.time()

        gen = T.ImageDataGenerator(

             featurewise_center=False,

             samplewise_center=False,

             featurewise_std_normalization=False,

             samplewise_std_normalization=False,

             zca_whitening=False,

             rotation_range=45,

             width_shift_range=.1,

             height_shift_range=.1,

             shear_range=0.,

             zoom_range=0,

             channel_shift_range=0,

             fill_mode='nearest',

             cval=0.,

             horizontal_flip=True,

             vertical_flip=False,

             rescale=None,

             preprocessing_function=preprocess_img,

             dim_ordering='default',

             pool=pool

        )

        gen.fit(X_train)

        X_train_aug = gen.flow(X_train, y_train_cat, seed=0)

        for i, (imgs, labels) in enumerate(X_train_aug):

            if i == 1000:

                break

        dur = time.time() - start

        #print(num_p, dur)

        sys.stdout.write('{}: {} ... '.format(num_p, dur))

        sys.stdout.flush()

        

        durs[num_p].append(dur)

        if pool:

            pool.terminate()

```

    ('Round', 0)

    0: 6.84576511383 ... 1: 9.6486890316 ... 2: 6.03799390793 ... 3: 4.88081693649 ... 4: 4.66870999336 ... 5: 3.70913481712 ... 6: 3.27630805969 ... 7: 3.48509907722 ... 8: 3.64657878876 ... 9: 3.74150896072 ... 10: 3.57441878319 ... 11: 3.60130214691 ... 12: 3.47499299049 ... ('Round', 1)

    0: 6.75701498985 ... 1: 9.94960093498 ... 2: 5.64250087738 ... 3: 5.06900811195 ... 4: 4.61409282684 ... 5: 4.57506585121 ... 6: 3.48270392418 ... 7: 3.51494693756 ... 8: 3.88235402107 ... 9: 3.62926697731 ... 10: 3.91224503517 ... 11: 3.59025716782 ... 12: 3.5045068264 ... ('Round', 2)

    0: 6.90472793579 ... 1: 9.55179905891 ... 2: 6.57418012619 ... 3: 5.2566280365 ... 4: 4.55560803413 ... 5: 4.45380306244 ... 6: 3.54513192177 ... 7: 3.21149206161 ... 8: 3.78789710999 ... 9: 3.67751908302 ... 10: 3.74882698059 ... 11: 3.98700881004 ... 12: 3.64187002182 ... ('Round', 3)

    0: 6.82807612419 ... 1: 9.48674917221 ... 2: 5.57596802711 ... 3: 4.74470591545 ... 4: 4.18711090088 ... 5: 3.89195489883 ... 6: 3.22924613953 ... 7: 3.17622900009 ... 8: 4.07523298264 ... 9: 3.59954690933 ... 10: 3.7366130352 ... 11: 3.52489495277 ... 12: 3.82451415062 ... ('Round', 4)

    0: 6.73704409599 ... 1: 9.2156291008 ... 2: 6.23566198349 ... 3: 5.13580393791 ... 4: 4.71229195595 ... 5: 3.35283398628 ... 6: 3.24846291542 ... 7: 3.79010605812 ... 8: 3.74294400215 ... 9: 3.76095604897 ... 10: 3.7142059803 ... 11: 3.54178500175 ... 12: 3.72024703026 ... ('Round', 5)

    0: 6.75245904922 ... 1: 10.7912859917 ... 2: 6.79878306389 ... 3: 4.67795395851 ... 4: 4.7692129612 ... 5: 3.99766302109 ... 6: 3.45177388191 ... 7: 3.30268979073 ... 8: 3.92767882347 ... 9: 3.69342398643 ... 10: 3.52480602264 ... 11: 3.46998000145 ... 12: 3.60531187057 ... ('Round', 6)

    0: 6.94973492622 ... 1: 9.72229290009 ... 2: 6.76698184013 ... 3: 5.28792905807 ... 4: 4.44634389877 ... 5: 4.34274101257 ... 6: 3.94904899597 ... 7: 3.34885692596 ... 8: 3.69488501549 ... 9: 3.87995219231 ... 10: 3.78279495239 ... 11: 3.49752092361 ... 12: 3.56351184845 ... ('Round', 7)

    0: 6.71522402763 ... 1: 10.2026801109 ... 2: 6.04175400734 ... 3: 5.20836210251 ... 4: 4.35653805733 ... 5: 4.39560294151 ... 6: 3.74392104149 ... 7: 3.19262504578 ... 8: 3.89874505997 ... 9: 3.41301083565 ... 10: 3.79124188423 ... 11: 3.90449810028 ... 12: 3.74271798134 ... ('Round', 8)

    0: 6.8355588913 ... 1: 9.49789810181 ... 2: 5.33640003204 ... 3: 5.41973185539 ... 4: 4.42942810059 ... 5: 4.30604100227 ... 6: 3.22810721397 ... 7: 3.24005103111 ... 8: 3.61394405365 ... 9: 3.50949716568 ... 10: 3.62207698822 ... 11: 3.84033894539 ... 12: 3.85311603546 ... ('Round', 9)

    0: 6.74057507515 ... 1: 10.3358399868 ... 2: 6.02810311317 ... 3: 5.41968894005 ... 4: 4.69001197815 ... 5: 3.6060628891 ... 6: 3.84348988533 ... 7: 3.67217493057 ... 8: 4.02522802353 ... 9: 3.74887800217 ... 10: 4.08099198341 ... 11: 3.81078886986 ... 12: 3.46359109879 ... 

```python

df = pd.DataFrame(durs)

df

```



  

    

      

      0

      1

      2

      3

      4

      5

      6

      7

      8

      9

      10

      11

      12

    

  

  

    

      0

      6.845765

      9.648689

      6.037994

      4.880817

      4.668710

      3.709135

      3.276308

      3.485099

      3.646579

      3.741509

      3.574419

      3.601302

      3.474993

    

    

      1

      6.757015

      9.949601

      5.642501

      5.069008

      4.614093

      4.575066

      3.482704

      3.514947

      3.882354

      3.629267

      3.912245

      3.590257

      3.504507

    

    

      2

      6.904728

      9.551799

      6.574180

      5.256628

      4.555608

      4.453803

      3.545132

      3.211492

      3.787897

      3.677519

      3.748827

      3.987009

      3.641870

    

    

      3

      6.828076

      9.486749

      5.575968

      4.744706

      4.187111

      3.891955

      3.229246

      3.176229

      4.075233

      3.599547

      3.736613

      3.524895

      3.824514

    

    

      4

      6.737044

      9.215629

      6.235662

      5.135804

      4.712292

      3.352834

      3.248463

      3.790106

      3.742944

      3.760956

      3.714206

      3.541785

      3.720247

    

    

      5

      6.752459

      10.791286

      6.798783

      4.677954

      4.769213

      3.997663

      3.451774

      3.302690

      3.927679

      3.693424

      3.524806

      3.469980

      3.605312

    

    

      6

      6.949735

      9.722293

      6.766982

      5.287929

      4.446344

      4.342741

      3.949049

      3.348857

      3.694885

      3.879952

      3.782795

      3.497521

      3.563512

    

    

      7

      6.715224

      10.202680

      6.041754

      5.208362

      4.356538

      4.395603

      3.743921

      3.192625

      3.898745

      3.413011

      3.791242

      3.904498

      3.742718

    

    

      8

      6.835559

      9.497898

      5.336400

      5.419732

      4.429428

      4.306041

      3.228107

      3.240051

      3.613944

      3.509497

      3.622077

      3.840339

      3.853116

    

    

      9

      6.740575

      10.335840

      6.028103

      5.419689

      4.690012

      3.606063

      3.843490

      3.672175

      4.025228

      3.748878

      4.080992

      3.810789

      3.463591

    

  



```python

df_mean = pd.DataFrame(df.mean(axis=0))

plt.figure(figsize=(10,5))

plt.plot(df_mean, marker='o')

plt.xlabel('# Processes')

plt.ylabel('Seconds')

plt.title('Image Augmentation time vs. # Processes')

```

    

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_22_1.png)

```python

speedups = 1 / (df_mean / df_mean[0][0])

plt.figure(figsize=(10,5))

plt.plot(speedups, marker='o')

plt.xlabel('# Processes')

plt.ylabel('Speedup')

plt.hlines(1, -1, df_mean.shape[0], colors='red', linestyles='dashed')

plt.title('Image Augmentation speedup vs. # Processes')

```

    

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_23_1.png)

```python

best_ix = np.argmax(speedups.values)

print('Best speedup: {0:.2f}x with {1} processes.'.format(speedups.values[best_ix][0], best_ix))

```

    Best speedup: 2.01x with 7 processes.

As we can see, we are able to cut image generation time in half. However, does the speedup remain when we are also sending the images to the GPU for network trianing?

## Benchmark: CIFAR10 - In Memory Performance, Image Generation with GPU Training

```python

import tools.sysmonitor as SM

reload(SM)

```

    

Let us take a model from one of the [Keras examples](https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py):

```python

from keras.models import Sequential

from keras.layers import Conv2D, Activation, MaxPooling2D, Dropout, Flatten, Dense

model = Sequential()

model.add(Conv2D(32, 3, 3, border_mode='same',

                        input_shape=(32, 32, 3)))

model.add(Activation('relu'))

model.add(Conv2D(32, 3, 3))

model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))

model.add(Conv2D(64, 3, 3, border_mode='same'))

model.add(Activation('relu'))

model.add(Conv2D(64, 3, 3))

model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Dropout(0.25))

model.add(Flatten())

model.add(Dense(512))

model.add(Activation('relu'))

model.add(Dropout(0.5))

model.add(Dense(10))

model.add(Activation('softmax'))

model.summary()

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

```

    ____________________________________________________________________________________________________

    Layer (type)                     Output Shape          Param #     Connected to                     

    ====================================================================================================

    convolution2d_8 (Convolution2D)  (None, 32, 32, 32)    896         convolution2d_input_3[0][0]      

    ____________________________________________________________________________________________________

    activation_12 (Activation)       (None, 32, 32, 32)    0           convolution2d_8[0][0]            

    ____________________________________________________________________________________________________

    convolution2d_9 (Convolution2D)  (None, 30, 30, 32)    9248        activation_12[0][0]              

    ____________________________________________________________________________________________________

    activation_13 (Activation)       (None, 30, 30, 32)    0           convolution2d_9[0][0]            

    ____________________________________________________________________________________________________

    maxpooling2d_6 (MaxPooling2D)    (None, 15, 15, 32)    0           activation_13[0][0]              

    ____________________________________________________________________________________________________

    dropout_5 (Dropout)              (None, 15, 15, 32)    0           maxpooling2d_6[0][0]             

    ____________________________________________________________________________________________________

    convolution2d_10 (Convolution2D) (None, 15, 15, 64)    18496       dropout_5[0][0]                  

    ____________________________________________________________________________________________________

    activation_14 (Activation)       (None, 15, 15, 64)    0           convolution2d_10[0][0]           

    ____________________________________________________________________________________________________

    convolution2d_11 (Convolution2D) (None, 13, 13, 64)    36928       activation_14[0][0]              

    ____________________________________________________________________________________________________

    activation_15 (Activation)       (None, 13, 13, 64)    0           convolution2d_11[0][0]           

    ____________________________________________________________________________________________________

    maxpooling2d_7 (MaxPooling2D)    (None, 6, 6, 64)      0           activation_15[0][0]              

    ____________________________________________________________________________________________________

    dropout_6 (Dropout)              (None, 6, 6, 64)      0           maxpooling2d_7[0][0]             

    ____________________________________________________________________________________________________

    flatten_3 (Flatten)              (None, 2304)          0           dropout_6[0][0]                  

    ____________________________________________________________________________________________________

    dense_5 (Dense)                  (None, 512)           1180160     flatten_3[0][0]                  

    ____________________________________________________________________________________________________

    activation_16 (Activation)       (None, 512)           0           dense_5[0][0]                    

    ____________________________________________________________________________________________________

    dropout_7 (Dropout)              (None, 512)           0           activation_16[0][0]              

    ____________________________________________________________________________________________________

    dense_6 (Dense)                  (None, 10)            5130        dropout_7[0][0]                  

    ____________________________________________________________________________________________________

    activation_17 (Activation)       (None, 10)            0           dense_6[0][0]                    

    ====================================================================================================

    Total params: 1,250,858

    Trainable params: 1,250,858

    Non-trainable params: 0

    ____________________________________________________________________________________________________

When we are running lengthier training sessions, we may want to interrupt training to try a different approach: tweak hyperparameters, choose a different optimizer, adjust the network architecture, etc. In order to handle this gracefully with multiprocessing, we need to tell the child processes to ignore the interrupt signals. The parent process will catch the KeyboardInterrupt exception allow us to continue working interactively in the Notebook. Without this infrastructure, the processes will remain in limbo as detailed [here](http://noswap.com/blog/python-multiprocessing-keyboardinterrupt).

```python

pool = None

def init_worker():

    signal.signal(signal.SIGINT, signal.SIG_IGN)

```

```python

def setup_generator(processes=None, batch_size=32):

    global pool

    try:

        pool.terminate()

    except:

        pass

    if processes: 

        pool = multiprocessing.Pool(processes=processes, initializer=init_worker)

    else:

        pool = None

    gen = T.ImageDataGenerator(

         featurewise_center=False,

         samplewise_center=False,

         featurewise_std_normalization=False,

         samplewise_std_normalization=False,

         zca_whitening=False,

         rotation_range=45,

         width_shift_range=.1,

         height_shift_range=.1,

         shear_range=0.,

         zoom_range=[.8, 1],

         channel_shift_range=20,

         fill_mode='nearest',

         cval=0.,

         horizontal_flip=True,

         vertical_flip=False,

         rescale=None,

         preprocessing_function=preprocess_img,

         dim_ordering='default',

         pool=pool

    )

    test_gen = T.ImageDataGenerator(

        preprocessing_function=preprocess_img,

        pool=pool

    )

    

    gen.fit(X_train)

    test_gen.fit(X_train)

    

    X_train_aug = gen.flow(X_train, y_train_cat, seed=0, batch_size=batch_size)

    X_test_aug = test_gen.flow(X_test, y_test_cat, seed=0, batch_size=batch_size)

    

    return X_train_aug, X_test_aug

```

```python

def run_benchmark(processes=None, batch_size=32, vert=True, plot=True):

    X_train_aug, X_test_aug = setup_generator(processes=processes, batch_size=batch_size)

    sys_mon = SM.SysMonitor()

    sys_mon.start()

    

    try:

        model.fit_generator(X_train_aug, samples_per_epoch=50000, nb_epoch=5, 

                        validation_data=X_test_aug, nb_val_samples=10000)

    except KeyboardInterrupt:

        print '\n\nTraining Interrupted\n'

        return None

    sys_mon.stop()

    

    title = None

    if not processes:

        title = '{0:.2f} seconds of computation, no multiprocessing, batch size = {1}'.format(sys_mon.duration, batch_size)

    else:

        title = '{0:.2f} seconds of computation, using {1} processes, batch size = {2}'.format(sys_mon.duration, processes, batch_size)

    

    if plot:

        sys_mon.plot(title, vert)

    

    if not processes:

        processes = 0

        

    return {

        'processes': processes,

        'batch_size': batch_size,

        'duration': sys_mon.duration,

        'title': title

    }

```

```python

run_benchmark(processes=None, batch_size=32)

```

    Epoch 1/5

     3552/50000 [=>............................] - ETA: 29s - loss: 2.1171 - acc: 0.2030

    

    Training Interrupted

    

```python

run_benchmark(processes=7, batch_size=32)

```

    Epoch 1/5

    11136/50000 [=====>........................] - ETA: 8s - loss: 1.8752 - acc: 0.3081

    

    Training Interrupted

    

Now let's try a variety of different test scenarios:

```python

runs = []

```

```python

runs.append(run_benchmark(processes=None, batch_size=32))

```

    Epoch 1/5

    50000/50000 [==============================] - 22s - loss: 1.1598 - acc: 0.5941 - val_loss: 0.8368 - val_acc: 0.7077

    Epoch 2/5

    50000/50000 [==============================] - 21s - loss: 1.1457 - acc: 0.6003 - val_loss: 0.8865 - val_acc: 0.6907

    Epoch 3/5

    50000/50000 [==============================] - 21s - loss: 1.1311 - acc: 0.6031 - val_loss: 0.8255 - val_acc: 0.7190

    Epoch 4/5

    50000/50000 [==============================] - 21s - loss: 1.1232 - acc: 0.6060 - val_loss: 0.8367 - val_acc: 0.7142

    Epoch 5/5

    50000/50000 [==============================] - 22s - loss: 1.1075 - acc: 0.6116 - val_loss: 0.8358 - val_acc: 0.7054

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_38_1.png)

```python

runs.append(run_benchmark(processes=7, batch_size=32))

```

    Epoch 1/5

    50000/50000 [==============================] - 11s - loss: 1.0912 - acc: 0.6165 - val_loss: 0.8329 - val_acc: 0.7103

    Epoch 2/5

    50000/50000 [==============================] - 11s - loss: 1.0838 - acc: 0.6232 - val_loss: 0.8299 - val_acc: 0.7053

    Epoch 3/5

    50000/50000 [==============================] - 11s - loss: 1.0736 - acc: 0.6245 - val_loss: 0.8385 - val_acc: 0.7092

    Epoch 4/5

    50000/50000 [==============================] - 11s - loss: 1.0671 - acc: 0.6258 - val_loss: 0.7994 - val_acc: 0.7238

    Epoch 5/5

    50000/50000 [==============================] - 11s - loss: 1.0670 - acc: 0.6283 - val_loss: 0.8347 - val_acc: 0.7133

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_39_1.png)

```python

runs[0]['duration'] / runs[1]['duration']

```

    1.8832975152491378

As we can see, we can get a **1.8x speedup by using 7 processes**. The GPU and CPU utilization is markedly higher and more consistent.

Let's see if batch size affects the outcome:

```python

runs.append(run_benchmark(processes=None, batch_size=256))

```

    Epoch 1/5

    50000/50000 [==============================] - 19s - loss: 1.0319 - acc: 0.6400 - val_loss: 0.7463 - val_acc: 0.7389

    Epoch 2/5

    50000/50000 [==============================] - 17s - loss: 1.0013 - acc: 0.6495 - val_loss: 0.7436 - val_acc: 0.7416

    Epoch 3/5

    50000/50000 [==============================] - 17s - loss: 0.9910 - acc: 0.6537 - val_loss: 0.7253 - val_acc: 0.7484

    Epoch 4/5

    50000/50000 [==============================] - 17s - loss: 0.9824 - acc: 0.6582 - val_loss: 0.7271 - val_acc: 0.7499

    Epoch 5/5

    50000/50000 [==============================] - 17s - loss: 0.9752 - acc: 0.6600 - val_loss: 0.6967 - val_acc: 0.7607

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_42_1.png)

```python

runs.append(run_benchmark(processes=7, batch_size=256))

```

    Epoch 1/5

    50000/50000 [==============================] - 5s - loss: 0.9585 - acc: 0.6660 - val_loss: 0.7220 - val_acc: 0.7495

    Epoch 2/5

    50000/50000 [==============================] - 5s - loss: 0.9553 - acc: 0.6671 - val_loss: 0.7071 - val_acc: 0.7546

    Epoch 3/5

    50000/50000 [==============================] - 5s - loss: 0.9502 - acc: 0.6690 - val_loss: 0.6920 - val_acc: 0.7640

    Epoch 4/5

    50000/50000 [==============================] - 5s - loss: 0.9525 - acc: 0.6687 - val_loss: 0.7103 - val_acc: 0.7558

    Epoch 5/5

    50000/50000 [==============================] - 5s - loss: 0.9452 - acc: 0.6713 - val_loss: 0.6999 - val_acc: 0.7565

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_43_1.png)

```python

runs[2]['duration'] / runs[3]['duration']

```

    3.3318531284663795

With a batch size of 256, we get an **even larger speedup of 3.3x**

```python

runs.append(run_benchmark(processes=None, batch_size=1024))

```

    Epoch 1/5

    50000/50000 [==============================] - 18s - loss: 0.9383 - acc: 0.6709 - val_loss: 0.6876 - val_acc: 0.7634

    Epoch 2/5

    50000/50000 [==============================] - 15s - loss: 0.9310 - acc: 0.6733 - val_loss: 0.6851 - val_acc: 0.7626

    Epoch 3/5

    50000/50000 [==============================] - 16s - loss: 0.9226 - acc: 0.6794 - val_loss: 0.6783 - val_acc: 0.7701

    Epoch 4/5

    50000/50000 [==============================] - 15s - loss: 0.9230 - acc: 0.6785 - val_loss: 0.6884 - val_acc: 0.7651

    Epoch 5/5

    50000/50000 [==============================] - 15s - loss: 0.9152 - acc: 0.6809 - val_loss: 0.6682 - val_acc: 0.7695

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_46_1.png)

```python

runs.append(run_benchmark(processes=7, batch_size=1024))

```

    Epoch 1/5

    50000/50000 [==============================] - 5s - loss: 0.9137 - acc: 0.6815 - val_loss: 0.6798 - val_acc: 0.7661

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.9161 - acc: 0.6814 - val_loss: 0.6771 - val_acc: 0.7649

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.9125 - acc: 0.6812 - val_loss: 0.6759 - val_acc: 0.7691

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.9133 - acc: 0.6814 - val_loss: 0.6786 - val_acc: 0.7673

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.9139 - acc: 0.6812 - val_loss: 0.6574 - val_acc: 0.7707

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_47_1.png)

```python

runs[4]['duration'] / runs[5]['duration']

```

    3.4816410549495163

With a batch size of 1024, we get **a speedup of 3.48%**. We also notice an interesting phenomenon. Without multiprocessing, the GPU is interittently going to 0 utilization. However, with 7 processes, we can see consistent >60% GPU utilization with a long initial period of >80%. Notice that with this batch size, we are able to get to lower losses a lot quicker than with lower batch sizes. This pattern will not necessarily continue with additional epochs, but it may be promising in some cases.

```python

runs.append(run_benchmark(processes=None, batch_size=4096))

```

    Epoch 1/5

    50000/50000 [==============================] - 19s - loss: 0.9059 - acc: 0.6812 - val_loss: 0.6704 - val_acc: 0.7696

    Epoch 2/5

    50000/50000 [==============================] - 16s - loss: 0.9116 - acc: 0.6829 - val_loss: 0.6654 - val_acc: 0.7666

    Epoch 3/5

    50000/50000 [==============================] - 14s - loss: 0.9002 - acc: 0.6867 - val_loss: 0.6626 - val_acc: 0.7719

    Epoch 4/5

    50000/50000 [==============================] - 16s - loss: 0.8984 - acc: 0.6863 - val_loss: 0.6678 - val_acc: 0.7688

    Epoch 5/5

    50000/50000 [==============================] - 15s - loss: 0.9041 - acc: 0.6847 - val_loss: 0.6647 - val_acc: 0.7663

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_50_1.png)

```python

runs.append(run_benchmark(processes=7, batch_size=4096))

```

    Epoch 1/5

    50000/50000 [==============================] - 6s - loss: 0.9057 - acc: 0.6823 - val_loss: 0.6678 - val_acc: 0.7680

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.9003 - acc: 0.6881 - val_loss: 0.6596 - val_acc: 0.7687

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8993 - acc: 0.6866 - val_loss: 0.6560 - val_acc: 0.7734

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.9034 - acc: 0.6857 - val_loss: 0.6641 - val_acc: 0.7713

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.9023 - acc: 0.6862 - val_loss: 0.6670 - val_acc: 0.7653

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_51_1.png)

```python

runs[6]['duration'] / runs[7]['duration']

```

    3.4145451599892525

A larger batch size of 4096 may or may not be a good choice in all cases, but when it comes to measuring system performance, we can see that the GPU usage is not consistent in the single process case. On the other hand, we are getting between 80-100% GPU utilization with 7 processes.

Let's do a final experiment with this dataset to see how Image Augmentation + GPU Training time scales with process count:

```python

processes_counts = [None]

processes_counts.extend(range(1, 13))

results = []

for pc in processes_counts:

    print('process count', pc)

    results.append(run_benchmark(processes=pc, batch_size=4096, plot=False))

```

    ('process count', None)

    Epoch 1/5

    50000/50000 [==============================] - 19s - loss: 0.8994 - acc: 0.6885 - val_loss: 0.6619 - val_acc: 0.7704

    Epoch 2/5

    50000/50000 [==============================] - 15s - loss: 0.9035 - acc: 0.6864 - val_loss: 0.6609 - val_acc: 0.7706

    Epoch 3/5

    50000/50000 [==============================] - 15s - loss: 0.8930 - acc: 0.6883 - val_loss: 0.6613 - val_acc: 0.7730

    Epoch 4/5

    50000/50000 [==============================] - 17s - loss: 0.8894 - acc: 0.6879 - val_loss: 0.6648 - val_acc: 0.7705

    Epoch 5/5

    50000/50000 [==============================] - 14s - loss: 0.8942 - acc: 0.6870 - val_loss: 0.6639 - val_acc: 0.7706

    ('process count', 1)

    Epoch 1/5

    50000/50000 [==============================] - 26s - loss: 0.8941 - acc: 0.6873 - val_loss: 0.6590 - val_acc: 0.7727

    Epoch 2/5

    50000/50000 [==============================] - 19s - loss: 0.8953 - acc: 0.6900 - val_loss: 0.6597 - val_acc: 0.7690

    Epoch 3/5

    50000/50000 [==============================] - 18s - loss: 0.8889 - acc: 0.6888 - val_loss: 0.6528 - val_acc: 0.7775

    Epoch 4/5

    50000/50000 [==============================] - 20s - loss: 0.8924 - acc: 0.6879 - val_loss: 0.6627 - val_acc: 0.7714

    Epoch 5/5

    50000/50000 [==============================] - 21s - loss: 0.8962 - acc: 0.6873 - val_loss: 0.6599 - val_acc: 0.7704

    ('process count', 2)

    Epoch 1/5

    50000/50000 [==============================] - 15s - loss: 0.8916 - acc: 0.6884 - val_loss: 0.6598 - val_acc: 0.7725

    Epoch 2/5

    50000/50000 [==============================] - 11s - loss: 0.8925 - acc: 0.6888 - val_loss: 0.6544 - val_acc: 0.7716

    Epoch 3/5

    50000/50000 [==============================] - 11s - loss: 0.8869 - acc: 0.6898 - val_loss: 0.6505 - val_acc: 0.7768

    Epoch 4/5

    50000/50000 [==============================] - 11s - loss: 0.8917 - acc: 0.6895 - val_loss: 0.6578 - val_acc: 0.7735

    Epoch 5/5

    50000/50000 [==============================] - 11s - loss: 0.8890 - acc: 0.6888 - val_loss: 0.6614 - val_acc: 0.7701

    ('process count', 3)

    Epoch 1/5

    50000/50000 [==============================] - 10s - loss: 0.8788 - acc: 0.6938 - val_loss: 0.6580 - val_acc: 0.7730

    Epoch 2/5

    50000/50000 [==============================] - 7s - loss: 0.8887 - acc: 0.6901 - val_loss: 0.6564 - val_acc: 0.7694

    Epoch 3/5

    50000/50000 [==============================] - 7s - loss: 0.8817 - acc: 0.6919 - val_loss: 0.6488 - val_acc: 0.7756

    Epoch 4/5

    50000/50000 [==============================] - 7s - loss: 0.8852 - acc: 0.6923 - val_loss: 0.6549 - val_acc: 0.7731

    Epoch 5/5

    50000/50000 [==============================] - 7s - loss: 0.8833 - acc: 0.6904 - val_loss: 0.6574 - val_acc: 0.7729

    ('process count', 4)

    Epoch 1/5

    50000/50000 [==============================] - 8s - loss: 0.8780 - acc: 0.6919 - val_loss: 0.6539 - val_acc: 0.7742

    Epoch 2/5

    50000/50000 [==============================] - 6s - loss: 0.8839 - acc: 0.6914 - val_loss: 0.6511 - val_acc: 0.7696

    Epoch 3/5

    50000/50000 [==============================] - 5s - loss: 0.8782 - acc: 0.6936 - val_loss: 0.6481 - val_acc: 0.7741

    Epoch 4/5

    50000/50000 [==============================] - 5s - loss: 0.8792 - acc: 0.6940 - val_loss: 0.6529 - val_acc: 0.7736

    Epoch 5/5

    50000/50000 [==============================] - 6s - loss: 0.8844 - acc: 0.6907 - val_loss: 0.6586 - val_acc: 0.7713

    ('process count', 5)

    Epoch 1/5

    50000/50000 [==============================] - 7s - loss: 0.8776 - acc: 0.6930 - val_loss: 0.6514 - val_acc: 0.7749

    Epoch 2/5

    50000/50000 [==============================] - 5s - loss: 0.8779 - acc: 0.6919 - val_loss: 0.6521 - val_acc: 0.7697

    Epoch 3/5

    50000/50000 [==============================] - 5s - loss: 0.8692 - acc: 0.6957 - val_loss: 0.6453 - val_acc: 0.7769

    Epoch 4/5

    50000/50000 [==============================] - 5s - loss: 0.8792 - acc: 0.6944 - val_loss: 0.6520 - val_acc: 0.7753

    Epoch 5/5

    50000/50000 [==============================] - 5s - loss: 0.8804 - acc: 0.6926 - val_loss: 0.6561 - val_acc: 0.7738

    ('process count', 6)

    Epoch 1/5

    50000/50000 [==============================] - 6s - loss: 0.8702 - acc: 0.6945 - val_loss: 0.6512 - val_acc: 0.7739

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.8708 - acc: 0.6949 - val_loss: 0.6470 - val_acc: 0.7715

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8686 - acc: 0.6964 - val_loss: 0.6417 - val_acc: 0.7766

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.8683 - acc: 0.6966 - val_loss: 0.6495 - val_acc: 0.7763

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.8692 - acc: 0.6997 - val_loss: 0.6525 - val_acc: 0.7752

    ('process count', 7)

    Epoch 1/5

    50000/50000 [==============================] - 6s - loss: 0.8676 - acc: 0.6960 - val_loss: 0.6477 - val_acc: 0.7746

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.8634 - acc: 0.6985 - val_loss: 0.6442 - val_acc: 0.7714

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8656 - acc: 0.6988 - val_loss: 0.6398 - val_acc: 0.7769

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.8694 - acc: 0.6967 - val_loss: 0.6495 - val_acc: 0.7749

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.8617 - acc: 0.6994 - val_loss: 0.6511 - val_acc: 0.7761

    ('process count', 8)

    Epoch 1/5

    50000/50000 [==============================] - 5s - loss: 0.8600 - acc: 0.6989 - val_loss: 0.6462 - val_acc: 0.7754

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.8620 - acc: 0.6997 - val_loss: 0.6404 - val_acc: 0.7743

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8563 - acc: 0.6982 - val_loss: 0.6389 - val_acc: 0.7768

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.8639 - acc: 0.6981 - val_loss: 0.6457 - val_acc: 0.7772

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.8653 - acc: 0.6995 - val_loss: 0.6504 - val_acc: 0.7762

    ('process count', 9)

    Epoch 1/5

    50000/50000 [==============================] - 5s - loss: 0.8581 - acc: 0.6996 - val_loss: 0.6442 - val_acc: 0.7769

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.8603 - acc: 0.6989 - val_loss: 0.6437 - val_acc: 0.7727

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8557 - acc: 0.7032 - val_loss: 0.6374 - val_acc: 0.7794

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.8620 - acc: 0.6998 - val_loss: 0.6439 - val_acc: 0.7776

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.8594 - acc: 0.6969 - val_loss: 0.6474 - val_acc: 0.7774

    ('process count', 10)

    Epoch 1/5

    50000/50000 [==============================] - 5s - loss: 0.8528 - acc: 0.7036 - val_loss: 0.6420 - val_acc: 0.7759

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.8518 - acc: 0.7028 - val_loss: 0.6378 - val_acc: 0.7756

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8491 - acc: 0.7034 - val_loss: 0.6332 - val_acc: 0.7793

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.8555 - acc: 0.7023 - val_loss: 0.6421 - val_acc: 0.7777

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.8520 - acc: 0.7011 - val_loss: 0.6459 - val_acc: 0.7750

    ('process count', 11)

    Epoch 1/5

    50000/50000 [==============================] - 5s - loss: 0.8518 - acc: 0.7010 - val_loss: 0.6389 - val_acc: 0.7795

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.8506 - acc: 0.7038 - val_loss: 0.6398 - val_acc: 0.7746

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8438 - acc: 0.7056 - val_loss: 0.6339 - val_acc: 0.7812

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.8553 - acc: 0.7020 - val_loss: 0.6393 - val_acc: 0.7784

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.8475 - acc: 0.7057 - val_loss: 0.6449 - val_acc: 0.7785

    ('process count', 12)

    Epoch 1/5

    50000/50000 [==============================] - 5s - loss: 0.8450 - acc: 0.7028 - val_loss: 0.6371 - val_acc: 0.7784

    Epoch 2/5

    50000/50000 [==============================] - 4s - loss: 0.8444 - acc: 0.7047 - val_loss: 0.6353 - val_acc: 0.7773

    Epoch 3/5

    50000/50000 [==============================] - 4s - loss: 0.8418 - acc: 0.7074 - val_loss: 0.6290 - val_acc: 0.7809

    Epoch 4/5

    50000/50000 [==============================] - 4s - loss: 0.8447 - acc: 0.7049 - val_loss: 0.6392 - val_acc: 0.7783

    Epoch 5/5

    50000/50000 [==============================] - 4s - loss: 0.8457 - acc: 0.7011 - val_loss: 0.6417 - val_acc: 0.7781

```python

durs_4096 = pd.DataFrame([x['duration'] for x in results])

plt.figure(figsize=(10,5))

plt.plot(durs_4096, marker='o')

plt.xlabel('# Processes')

plt.ylabel('Seconds')

plt.title('Image Augmentation + GPU Training time vs. # Processes')

```

    

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_55_1.png)

```python

speedups_4096 = 1 / (durs_4096 / durs_4096.ix[0])

plt.figure(figsize=(10,5))

plt.plot(speedups_4096, marker='o')

plt.xlabel('# Processes')

plt.ylabel('Speedup')

plt.hlines(1, -1, speedups_4096.shape[0], colors='red', linestyles='dashed')

plt.title('Image Augmentation + GPU Training speedup vs. # Processes')

```

    

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_56_1.png)

```python

best_ix = np.argmax(speedups_4096.values)

print('Best speedup: {0:.2f}x with {1} processes.'.format(speedups_4096.values[best_ix][0], best_ix))

```

    Best speedup: 3.51x with 9 processes.

## Benchmark: Dogs vs. Cats - On Disk Performance, Image Generation witih GPU Training

Using the images in the [dogs vs. cats dataset](https://www.kaggle.com/c/dogs-vs-cats/data) provided by Kaggle, we can test the performance of image augmentation on images loaded from disk on the fly.

To follow along, unzip the downloaded training zip file, then create a `data/train/cat`, `/data/train/dog`, `data/validation/cat`, and `data/validation/dog` folders. 

Then move the images that have indicies starting with 8 into the appropriate validation folders.

`

mv cat.8* data/validation/cat/

mv dog.8* data/validation/dog/

`

```python

import os

paths = sorted(os.listdir('./data/train/cat'))

fig, ax = plt.subplots(5, 5, figsize=(15, 15))

for i in range(5):

    for j in range(5):

        ix = i*5 + j

        img = plt.imread('./data/train/cat/' + paths[ix])

        ax[i][j].imshow(img)

```

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_60_0.png)

```python

paths = sorted(os.listdir('./data/train/dog'))

fig, ax = plt.subplots(5, 5, figsize=(15, 15))

for i in range(5):

    for j in range(5):

        ix = i*5 + j

        img = plt.imread('./data/train/dog/' + paths[ix])

        ax[i][j].imshow(img)

```

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_61_0.png)

```python

gen = T.ImageDataGenerator(

     featurewise_center=False,

     samplewise_center=False,

     featurewise_std_normalization=False,

     samplewise_std_normalization=False,

     zca_whitening=False,

     rotation_range=45,

     width_shift_range=.1,

     height_shift_range=.1,

     shear_range=0.,

     zoom_range=[.8, 1],

     channel_shift_range=0,

     fill_mode='nearest',

     cval=0.,

     horizontal_flip=True,

     vertical_flip=False,

     rescale=1/255.,

#      preprocessing_function=preprocess_img,

     #dim_ordering='default',

#      pool=None

)

test_gen = T.ImageDataGenerator(

    preprocessing_function=preprocess_img,

#     pool=None

)

train_generator = gen.flow_from_directory(

    'data/train',

    target_size=(150, 150),

    batch_size=32,

    class_mode='binary')

test_generator = gen.flow_from_directory(

    'data/validation',

    target_size=(150, 150),

    batch_size=32,

    class_mode='binary')

fig, ax = plt.subplots(6, 6, figsize=(15, 15))

for (imgs, labels) in train_generator:

    for i in range(6):

        for j in range(6):

            if i*6 + j < 32:

                ax[i][j].imshow(imgs[i*6 + j])

    break

```

    Found 22778 images belonging to 2 classes.

    Found 2222 images belonging to 2 classes.

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_62_1.png)

```python

from keras.models import Sequential

from keras.layers import Convolution2D, MaxPooling2D

from keras.layers import Activation, Dropout, Flatten, Dense

model = Sequential()

model.add(Convolution2D(32, 3, 3, input_shape=(299, 299, 3)))

model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(32, 3, 3))

model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(64, 3, 3))

model.add(Activation('relu'))

model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())  # this converts our 3D feature maps to 1D feature vectors

model.add(Dense(64))

model.add(Activation('relu'))

model.add(Dropout(0.5))

model.add(Dense(1))

model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',

              optimizer='rmsprop',

              metrics=['accuracy'])

```

```python

def setup_cat_dog_generator(processes=None, batch_size=32):

    global pool

    try:

        pool.terminate()

    except:

        pass

    if processes: 

        pool = multiprocessing.Pool(processes=processes, initializer=init_worker)

    else:

        pool = None

    gen = T.ImageDataGenerator(

         featurewise_center=False,

         samplewise_center=False,

         featurewise_std_normalization=False,

         samplewise_std_normalization=False,

         zca_whitening=False,

         rotation_range=45,

         width_shift_range=.1,

         height_shift_range=.1,

         shear_range=0.,

         zoom_range=[.8, 1],

         channel_shift_range=20,

         fill_mode='nearest',

         cval=0.,

         horizontal_flip=True,

         vertical_flip=False,

         rescale=None,

         preprocessing_function=preprocess_img,

         dim_ordering='default',

         pool=pool

    )

    test_gen = T.ImageDataGenerator(

        preprocessing_function=preprocess_img,

        pool=pool

    )

    

    gen.fit(X_train)

    test_gen.fit(X_train)

    

    X_train_aug = gen.flow_from_directory(

        'data/train',

        target_size=(299, 299),

        batch_size=batch_size,

        class_mode='binary')

    X_test_aug = gen.flow_from_directory(

        'data/validation',

        target_size=(299, 299),

        batch_size=batch_size,

        class_mode='binary')

    

    return X_train_aug, X_test_aug

```

```python

def run_cat_dog_benchmark(processes=None, batch_size=32, vert=True, plot=True):

    

    X_train_aug, X_test_aug = setup_cat_dog_generator(processes=processes, batch_size=batch_size)

    sys_mon = SM.SysMonitor()

    sys_mon.start()

    

    try:        

        model.fit_generator(

                X_train_aug,

                samples_per_epoch=22778,

                nb_epoch=2,

                validation_data=X_test_aug,

                nb_val_samples=2222)

    except KeyboardInterrupt:

        print '\n\nTraining Interrupted\n'

        return None

    sys_mon.stop()

    

    title = None

    if not processes:

        title = '{0:.2f} seconds of computation, no multiprocessing, batch size = {1}'.format(sys_mon.duration, batch_size)

    else:

        title = '{0:.2f} seconds of computation, using {1} processes, batch size = {2}'.format(sys_mon.duration, processes, batch_size)

    

    if plot:

        sys_mon.plot(title, vert)

    

    if not processes:

        processes = 0

        

    return {

        'processes': processes,

        'batch_size': batch_size,

        'duration': sys_mon.duration,

        'title': title

    }

```

In the following benchmark runs, you can see how inconsistent the GPU is being used without multiprocessing. Even with multiprocessing, the CPU is struggling to get enough data to the GPU to keep the GPU utilization stable. However, it's averaging out to be much higher than before.

Before running each benchmark, I run: 

`sync; echo 3 > /proc/sys/vm/drop_caches` 

in the shell. This clears any diles that may be cached in memory that could be skewing the benchmarking results.

```python

runs = []

```

```python

runs.append(run_cat_dog_benchmark(processes=None, batch_size=64))

```

    Found 22778 images belonging to 2 classes.

    Found 2222 images belonging to 2 classes.

    Epoch 1/2

    22778/22778 [==============================] - 326s - loss: 0.6311 - acc: 0.6492 - val_loss: 0.5449 - val_acc: 0.7151

    Epoch 2/2

    22778/22778 [==============================] - 313s - loss: 0.5782 - acc: 0.7043 - val_loss: 0.5174 - val_acc: 0.7480

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_68_1.png)

```python

runs.append(run_cat_dog_benchmark(processes=7, batch_size=64))

```

    Found 22778 images belonging to 2 classes.

    Found 2222 images belonging to 2 classes.

    Epoch 1/2

    22778/22778 [==============================] - 90s - loss: 0.5475 - acc: 0.7274 - val_loss: 0.4989 - val_acc: 0.7610

    Epoch 2/2

    22778/22778 [==============================] - 87s - loss: 0.5318 - acc: 0.7417 - val_loss: 0.4973 - val_acc: 0.7610

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_69_1.png)

```python

runs.append(run_cat_dog_benchmark(processes=11, batch_size=64))

```

    Found 22778 images belonging to 2 classes.

    Found 2222 images belonging to 2 classes.

    Epoch 1/2

    22778/22778 [==============================] - 81s - loss: 0.5181 - acc: 0.7514 - val_loss: 0.5052 - val_acc: 0.7520

    Epoch 2/2

    22778/22778 [==============================] - 80s - loss: 0.5086 - acc: 0.7591 - val_loss: 0.4665 - val_acc: 0.7691

![png](Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_files/Accelerating%20Deep%20Learning%20with%20Multiprocess%20Image%20Augmentation%20in%20Keras_70_1.png)

```python

runs[0]['duration'] / runs[2]['duration']

```

    3.9467718654410233

As we can see, we can get an even bigger performance gain when flowing from disk. Using 11 processes, we are getting 3.94x performance over single threaded. This will really help a lot when working with larger than memory datasets.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stratospark/keras-multiprocess-image-data-generator

Awesome Lists containing this project

README