https://github.com/brookisme/dfgen
Keras Image Generator from Dataframes
https://github.com/brookisme/dfgen
deeplearning keras machine-learning tensorflow theano
Last synced: 2 months ago
JSON representation
Keras Image Generator from Dataframes
- Host: GitHub
- URL: https://github.com/brookisme/dfgen
- Owner: brookisme
- Created: 2017-07-02T23:24:02.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2017-07-18T16:03:53.000Z (almost 9 years ago)
- Last Synced: 2025-08-07T18:38:51.228Z (10 months ago)
- Topics: deeplearning, keras, machine-learning, tensorflow, theano
- Language: Python
- Size: 68.4 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## DFGen
**Keras Image Generator from Dataframes**
Creates generator from csv or dataframe.
Optional Features:
1. convert "tag" list to binary valued label vector for predictions
2. save to train/test split files
3. easy configuration with [yaml](#yaml) file
##### INSTALL
```bash
cd ~/
git clone https://github.com/brookisme/dfgen.git
cd dfgen
pip install -e .
```
---
##### USAGE
In the examples below we have used the `dfg_config.yaml` file located [here](https://github.com/brookisme/dfgen/blob/master/example.dfg_config.yaml).
- [Init|Train|Test](#traintest)
- [Reduce Columns](#reduce_columns)
- [DFGen.require_label](#require_label)
- [Generator and Lambda](#lambda)
---
###### save (processed) data to train and test csvs
```bash
# bash
$ head data.csv
image_name,tags
train_0,haze primary
train_1,agriculture clear primary water
train_2,clear primary
# python
>>> from dfgen import DFGen
>>> gen=DFGen(csv_file='data.csv',csv_sep=',')
>>> gen.dataframe.sample(2)
image_name tags \
7901 train_7901 clear primary water
38214 train_38214 clear primary
labels \
7901 [1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
38214 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
paths
7901 images/tif/train_7901.tif
38214 images/tif/train_38214.tif
# save as train/test split
>>> gen.save(path='train.csv',split_path='test.csv')
# or save the processed data (with labels, paths, require's)
>>> gen.save(path='processed_data.csv')
# if you want small datasets for developement you can use limit
>>> gen=DFGen(csv_file='train.csv',csv_sep=',')
>>> gen.size
40479
>>> gen.limit(400)
>>> gen.size
400
>>> gen.save(path='dev_train.csv',split_path='dev_test.csv',split=100)
# side note: dfg_config file specifies tif but we could have loaded JPGs
>>> gen=DFGen(csv_file='data.csv',csv_sep=',',image_ext='jpg')
>>> gen.dataframe.paths.sample(2)
21628 images/jpg/train_21628.jpg
7955 images/jpg/train_7955.jpg
Name: paths, dtype: object
```
###### load data to train and test generators
```bash
# bash (note we have the label and path columns)
$ head train.csv
image_name,tags,labels,paths
train_12485,agriculture clear primary,"[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",images/tif/train_12485.tif
train_3535,clear cultivation primary,"[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",images/tif/train_3535.tif
train_4857,agriculture cultivation habitation partly_cloudy primary road,"[1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]",images/tif/train_4857.tif
# python
>>> train_gen=DFGen(csv_file='train.csv',csv_sep=',')
>>> test_gen=DFGen(csv_file='test.csv',csv_sep=',')
>>> train_gen.size/gen.size
0.8000197633340744
>>> test_gen.size/gen.size
0.19998023666592554
```
---
```bash
>>> gen=DFGen(csv_file='train.csv',csv_sep=',',image_ext='tif')
>>> gen.size
40479
>>> gen.tags
['primary', 'clear', 'agriculture', 'road', 'water', 'partly_cloudy', 'cultivation', 'habitation', 'haze', 'cloudy', 'bare_ground', 'selective_logging', 'artisinal_mine', 'blooming', 'slash_burn', 'conventional_mine', 'blow_down']
>>> gen.dataframe_with_tags('blow_down','cultivation').size
32
>>> gen.reduce_columns('blow_down','cultivation')
>>> gen.tags
['blow_down', 'cultivation', 'others']
>>> gen.dataframe.sample(2)
image_name tags labels \
6550 train_6550 clear primary [0, 0, 1]
30966 train_30966 agriculture clear primary road water [0, 0, 1]
paths
6550 images/tif/train_6550.tif
30966 images/tif/train_30966.tif
>>> gen.reduce_columns('blow_down','cultivation',others=False)
>>> gen.tags
['blow_down', 'cultivation']
>>> gen.dataframe.sample(2)
image_name tags labels paths
31901 train_31901 partly_cloudy primary [0, 1] images/tif/train_31901.tif
14158 train_14158 partly_cloudy primary [0, 1] images/tif/train_14158.tif
```
---
###### using require_label to reduce dataset
```bash
>>> from dfgen import DFGen
>>> gen=DFGen(csv_file='data.csv',csv_sep=',')
>>> gen.size
40479
>>> gen.dataframe.head(2)
image_name tags \
16452 train_16452 agriculture clear habitation primary road
20043 train_20043 clear primary
labels \
16452 [1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...
20043 [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
paths
16452 images/tif/train_16452.tif
20043 images/tif/train_20043.tif
#
# REQUIRE_LABEL:
#
>>> gen.require_label('blow_down',70)
>>> gen.size
140
>>> gen.tags
['primary', 'clear', 'agriculture', 'road', 'water', 'partly_cloudy', 'cultivation', 'habitation', 'haze', 'cloudy', 'bare_ground', 'selective_logging', 'artisinal_mine', 'blooming', 'slash_burn', 'conventional_mine', 'blow_down']
>>> gen.dataframe.sample(2)
image_name tags \
55 train_23025 blow_down clear cultivation habitation primary
101 train_20618 clear cultivation primary
labels \
55 [1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...
101 [1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
paths
55 images/tif/train_23025.tif
101 images/tif/train_20618.tif
#
# REQUIRE_LABEL: reduce_to_others=True
# - this is the same as:
# gen.require_label('blow_down',70)
# gen.reduce_columns('blow_down')
#
>>> gen.require_label('blow_down',70,reduce_to_others=True)
>>> gen.size
140
>>> gen.tags
['blow_down', 'others']
>>> gen.dataframe.sample(2)
image_name tags labels \
12 train_38607 agriculture blow_down partly_cloudy primary [1, 1]
24 train_31495 blow_down clear primary blow_down [1, 1]
paths
12 images/tif/train_38607.tif
109 images/tif/train_10679.tif
#
# COMBINING REQUIRE LABELs
#
>>> from dfgen import DFGen
>>> gen=DFGen(csv_file='data.csv',csv_sep=',',image_ext='tif')
>>> gen.size
183
# You can fetch the rows with specific tags
>>> gen.dataframe_with_tags('blow_down','cultivation').size
32
>>> gen.dataframe_with_tags('blow_down','cultivation').head(2)
image_name tags \
25950 train_25950 agriculture blooming blow_down clear cultivati...
9961 train_9961 agriculture blow_down clear cultivation primary
labels \
25950 [1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
9961 [1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...
paths
25950 images/tif/train_25950.tif
9961 images/tif/train_9961.tif
# RequireLabel and check percentages
>>> gen.require_label('blow_down',10)
>>> gen.dataframe_with_tags('blow_down').shape[0]/gen.size
0.1
>>> gen.require_label('cultivation',60)
>>> gen.dataframe_with_tags('cultivation').shape[0]/gen.size
0.6010928961748634
# NOTE: The second require label effects the first.
# We no longer have exactly 10% blow_down.
>>> gen.dataframe_with_tags('blow_down').shape[0]/gen.size
0.07650273224043716
```
---
###### generator and lambda
```bash
>>> from dfgen import DFGen
>>> gen=DFGen(csv_file='data.csv',csv_sep=',')
# returns first batch tuple (images,labels)
>>> batch=next(gen)
# so batch[0][0] is the np.array for the first image in the batch
# in this case the image has 4 bands: [blue, green, red, nir]
#
# LETS PREPROCESS THE IMAGES
#
def ndvi(img):
r=img[:,:,2]
nir=img[:,:,3]
return (nir-r)/(nir+r)
def ndvi_img(img):
ndvi_band=_ndvi(img)
img[:,:,3]=ndvi_band
return img
>>> gen=DFGen(csv_file='data.csv',csv_sep=',',lambda_func=ndvi_img)
# returns first batch tuple (ndvi-images,labels)
>>> batch=next(gen)
# now batch[0][0] is the np.array for the first image in the
# preprocessed-batch. its the same image as above but it has been
# passed through the 'ndvi_image' method.
# The 4 bands are now: [blue, green, red, ndvi]
```
---
##### COMMENT-DOCS
```
""" CREATES GENERATOR FROM DATAFRAME
create generator from existing dataframe or from a csv
Methods:
.require_label: ensure a min percentage of a particular label
.save: save processed csv to csv or as train/test-split csvs
.__next__: generator method, batchwise return tuple of (images,labels)
Args:
* image_column (column with image path or name) is required
* label_column column with label "vectors" is required
- if the label_column already exists the dataframe will contain the labels
- if the label_column does not exsit and both tags and tags_to_labels_column
are specified the tags will be converted to binary valued vectors
* tags: optional list of tags in corresponding to places in the label vectors
* tags_to_labels_column: name of a column that contain a space seperated
string of tags. these strings will be converted to the binary label vectors
* image_dir: root path for image_paths given in "image_column"
* image_ext:
- append to image_column values when loading images
- if using dfg_config file image_ext can determine image_dir
* lambda_func: function that acts on image data before returned to user
* batch_size: batch_size
"""
...
def require_label(self,label_index_or_tag,pct,exact=False,reduce_to_others=False):
"""
Warning: Ordering matters
.require_label(1,40)
.require_label(2,20)
may not equal:
.require_label(2,20)
.require_label(1,40)
Args:
* label_index_or_tag:
- (label_index): index of the label of interest
- (tag): if "tags": the name of the tag of interest
* pct: percentage required for label
* exact:
if False and there is the label already has >= pct of dataset
return full-dataset
else: remove data so that label is pct of dataset
* reduce to others.
return labels as 2 vectors [label,others]
"""
...
def reduce_columns(self,*indices_or_tags,others=True):
""" Keep passed columns and optional "others"
Usage:
gen.reduce_to_others('blow_down','cultivation')
Args:
* str or int arguments: label indices or tag names
* others:
- if falsey: do not include "others column"
- else:
include "others"
- if others arg is : use others arg as column name
- else: use "others" as column name
"""
...
def limit(self,nb_rows):
""" limit number of rows in dataframe
Use to create dev training sets
"""
...
def dataframe_with_tags(self,*tags):
""" return dataframe rows containing certain tags
Args: strings of tag names
ie. gen.dataframe_with_tags('blow_down','clear')
"""
...
def save(self,path,split_path=None,split=0.2,sep=None):
""" save dataframe to csv(s)
usually save after processing (ie: tags->labels and/or require_label),
so you wont need to process again.
if split_path and split:
- split dataframe into 2 csvs (path and save path)
- if split is int: split = number of lines in split_csv
else: split = % of full dataframe
"""
...
```
---
##### EXAMPLE CONFIG (in directory with .py or ipynb file)
[dfg_config.yaml](https://github.com/brookisme/dfgen/blob/master/example.dfg_config.yaml)
```
# COLUMN NAMES
image_column: image_name
label_column: labels
tags_column: tags
# IMAGE EXT
image_ext: tif
# IMAGE DIR BY EXT
image_dirs:
tif: images/tif
jpg: images/jpg
# BACKUP IMAGE DIR
image_dir: images/other
# TAGS
tags:
- primary
- clear
- agriculture
- road
- water
- partly_cloudy
- cultivation
- habitation
- haze
- cloudy
- bare_ground
- selective_logging
- artisinal_mine
- blooming
- slash_burn
- conventional_mine
- blow_down
```