Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/overlordgolddragon/see-rnn
RNN and general weights, gradients, & activations visualization in Keras & TensorFlow
https://github.com/overlordgolddragon/see-rnn
deep-learning gru keras lstm rnn tensorflow visualization
Last synced: about 11 hours ago
JSON representation
RNN and general weights, gradients, & activations visualization in Keras & TensorFlow
- Host: GitHub
- URL: https://github.com/overlordgolddragon/see-rnn
- Owner: OverLordGoldDragon
- License: mit
- Created: 2019-11-24T10:37:11.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-02-13T23:13:28.000Z (10 months ago)
- Last Synced: 2024-12-22T21:09:01.773Z (about 11 hours ago)
- Topics: deep-learning, gru, keras, lstm, rnn, tensorflow, visualization
- Language: Python
- Homepage:
- Size: 257 KB
- Stars: 179
- Watchers: 3
- Forks: 21
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# See RNN
[![Build Status](https://travis-ci.com/OverLordGoldDragon/see-rnn.svg?token=dGKzzAxzJjaRLzddNsCd&branch=master)](https://travis-ci.com/OverLordGoldDragon/see-rnn)
[![Coverage Status](https://coveralls.io/repos/github/OverLordGoldDragon/see-rnn/badge.svg?branch=master&service=github&kill_cache=1)](https://coveralls.io/github/OverLordGoldDragon/see-rnn?branch=master)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/e15b1b772c3f4dc9ba7988784a2b9bf6)](https://www.codacy.com/manual/OverLordGoldDragon/see-rnn?utm_source=github.com&utm_medium=referral&utm_content=OverLordGoldDragon/see-rnn&utm_campaign=Badge_Grade)
[![PyPI version](https://badge.fury.io/py/see-rnn.svg)](https://badge.fury.io/py/see-rnn)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5080359.svg)](https://doi.org/10.5281/zenodo.5080359)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)![](https://img.shields.io/badge/keras-tensorflow-blue.svg)
![](https://img.shields.io/badge/keras-tf.keras-blue.svg)
![](https://img.shields.io/badge/keras-tf.keras/eager-blue.svg)
![](https://img.shields.io/badge/keras-tf.keras/2.0-blue.svg)RNN weights, gradients, & activations visualization in Keras & TensorFlow (LSTM, GRU, SimpleRNN, CuDNN, & all others)
## Features
- **Weights, gradients, activations** visualization
- **Kernel visuals**: kernel, recurrent kernel, and bias shown explicitly
- **Gate visuals**: gates in gated architectures (LSTM, GRU) shown explicitly
- **Channel visuals**: cell units (feature extractors) shown explicitly
- **General visuals**: methods also applicable to CNNs & others
- **Weight norm tracking**: useful for analyzing weight decay## Why use?
Introspection is a powerful tool for debugging, regularizing, and understanding neural networks; this repo's methods enable:
- Monitoring **weights & activations progression** - how each changes epoch-to-epoch, iteration-to-iteration
- Evaluating **learning effectiveness** - how well gradient backpropagates layer-to-layer, timestep-to-timestep
- Assessing **layer health** - what percentage of neurons are "dead" or "exploding"
- Tracking **weight decay** - how various schemes (e.g. l2 penalty) affect weight normsIt enables answering questions such as:
- Is my RNN learning **long-term dependencies**? >> Monitor gradients: if a non-zero gradient flows through every timestep, then _every timestep contributes to learning_ - i.e., resultant gradients stem from accounting for every input timestep, so the _entire sequence influences weight updates_. Hence, an RNN _no longer ignores portions of long sequences_, and is forced to _learn from them_
- Is my RNN learning **independent representations**? >> Monitor activations: if each channel's outputs are distinct and decorrelated, then the RNN extracts richly diverse features.
- Why do I have **validation loss spikes**? >> Monitor all: val. spikes may stem from sharp changes in layer weights due to large gradients, which will visibly alter activation patterns; seeing the details can help inform a correction
- Is my **weight decay excessive** or insufficient? >> Monitor weight norms: if values slash to many times less their usual values, decay might be excessive - or, if no effect is seen, increase decayFor further info on potential uses, see [this SO](https://stackoverflow.com/questions/48714407/rnn-regularization-which-component-to-regularize/58868383#58868383).
## Installation
`pip install see-rnn`. Or, for latest version (most likely stable):
`pip install git+https://github.com/OverLordGoldDragon/see-rnn`
## To-do
Will possibly implement:
- [x] Weight norm inspection (all layers); see [here](https://stackoverflow.com/q/61481921/10133797)
- [ ] Pytorch support
- [ ] Interpretability visuals (e.g. saliency maps, adversarial attacks)
- [ ] Tools for better probing backprop of `return_sequences=False`
- [ ] Unify `_id` and `layer`? Need duplicates resolution scheme## Examples
```python
# for all examples
grads = get_gradients(model, 1, x, y) # return_sequences=True, layer index 1
grads = get_gradients(model, 2, x, y) # return_sequences=False, layer index 2
outs = get_outputs(model, 1, x) # return_sequences=True, layer index 1
# all examples use timesteps=100
# NOTE: `title_mode` kwarg below was omitted for simplicity; for Gradient visuals, would set to 'grads'
```
**EX 1: bi-LSTM, 32 units** - activations, `activation='relu'`
`features_1D(outs[:1], share_xy=False)`
`features_1D(outs[:1], share_xy=True, y_zero=True)`- Each subplot is an independent RNN channel's output (`return_sequences=True`)
- In this example, each channel/filter appears to extract complex independent features of varying bias, frequency, and probabilistic distribution
- Note that `share_xy=False` better pronounces features' _shape_, whereas `=True` allows for an even comparison - but may greatly 'shrink' waveforms to appear flatlined (not shown here)
**EX 2: one sample, uni-LSTM, 6 units** - gradients, `return_sequences=True`, trained for 20 iterations
`features_1D(grads[:1], n_rows=2)`- _Note_: gradients are to be read _right-to-left_, as they're computed (from last timestep to first)
- Rightmost (latest) timesteps consistently have a higher gradient
- **Vanishing gradient**: ~75% of leftmost timesteps have a zero gradient, indicating poor time dependency learning[![enter image description here][1]][1]
**EX 3: all (16) samples, uni-LSTM, 6 units** -- `return_sequences=True`, trained for 20 iterations
`features_1D(grads, n_rows=2)`
`features_2D(grads, n_rows=4, norm=(-.01, .01))`- Each sample shown in a different color (but same color per sample across channels)
- Some samples perform better than one shown above, but not by much
- The heatmap plots channels (y-axis) vs. timesteps (x-axis); blue=-0.01, red=0.01, white=0 (gradient values)[![enter image description here][2]][2]
[![enter image description here][3]][3]
**EX 4: all (16) samples, uni-LSTM, 6 units** -- `return_sequences=True`, trained for 200 iterations
`features_1D(grads, n_rows=2)`
`features_2D(grads, n_rows=4, norm=(-.01, .01))`- Both plots show the LSTM performing clearly better after 180 additional iterations
- Gradient still vanishes for about half the timesteps
- All LSTM units better capture time dependencies of one particular sample (blue curve, first plot) - which we can tell from the heatmap to be the first sample. We can plot that sample vs. other samples to try to understand the difference[![enter image description here][4]][4]
[![enter image description here][5]][5]
**EX 5: 2D vs. 1D, uni-LSTM**: 256 units, `return_sequences=True`, trained for 200 iterations
`features_1D(grads[0, :, :])`
`features_2D(grads[:, :, 0], norm=(-.0001, .0001))`- 2D is better suited for comparing many channels across few samples
- 1D is better suited for comparing many samples across a few channels[![enter image description here][6]][6]
**EX 6: bi-GRU, 256 units (512 total)** -- `return_sequences=True`, trained for 400 iterations
`features_2D(grads[0], norm=(-.0001, .0001), reflect_half=True)`- Backward layer's gradients are flipped for consistency w.r.t. time axis
- Plot reveals a lesser-known advantage of Bi-RNNs - _information utility_: the collective gradient covers about twice the data. _However_, this isn't free lunch: each layer is an independent feature extractor, so learning isn't really complemented
- Lower `norm` for more units is expected, as approx. the same loss-derived gradient is being distributed across more parameters (hence the squared numeric average is less)
**EX 7: 0D, all (16) samples, uni-LSTM, 6 units** -- `return_sequences=False`, trained for 200 iterations
`features_0D(grads)`- `return_sequences=False` utilizes only the last timestep's gradient (which is still derived from all timesteps, unless using truncated BPTT), requiring a new approach
- Plot color-codes each RNN unit consistently across samples for comparison (can use one color instead)
- Evaluating gradient flow is less direct and more theoretically involved. One simple approach is to compare distributions at beginning vs. later in training: if the difference isn't significant, the RNN does poorly in learning long-term dependencies
**EX 8: LSTM vs. GRU vs. SimpleRNN, unidir, 256 units** -- `return_sequences=True`, trained for 250 iterations
`features_2D(grads, n_rows=8, norm=(-.0001, .0001), show_xy_ticks=[0,0], title_mode=False)`- _Note_: the comparison isn't very meaningful; each network thrives w/ different hyperparameters, whereas same ones were used for all. LSTM, for one, bears the most parameters per unit, drowning out SimpleRNN
- In this setup, LSTM definitively stomps GRU and SimpleRNN[![enter image description here][7]][7]
**EX 9: uni-LSTM, 256 units, weights** -- `batch_shape = (16, 100, 20)` (input)
`rnn_histogram(model, 'lstm', equate_axes=False, bias=False)`
`rnn_histogram(model, 'lstm', equate_axes=True, bias=False)`
`rnn_heatmap(model, 'lstm')`- Top plot is a histogram subplot grid, showing weight distributions per kernel, and within each kernel, per gate
- Second plot sets `equate_axes=True` for an even comparison across kernels and gates, improving quality of comparison, but potentially degrading visual appeal
- Last plot is a heatmap of the same weights, with gate separations marked by vertical lines, and bias weights also included
- Unlike histograms, the heatmap _preserves channel/context information_: input-to-hidden and hidden-to-hidden transforming matrices can be clearly distinguished
- Note the large concentration of maximal values at the Forget gate; as trivia, in Keras (and usually), bias gates are all initialized to zeros, except the Forget bias, which is initialized to ones
**EX 10: bi-CuDNNLSTM, 256 units, weights** -- `batch_shape = (16, 100, 16)` (input)
`rnn_histogram(model, 'bidir', equate_axes=2)`
`rnn_heatmap(model, 'bidir', norm=(-.8, .8))`- Bidirectional is supported by both; biases included in this example for histograms
- Note again the bias heatmaps; they no longer appear to reside in the same locality as in EX 1. Indeed, `CuDNNLSTM` (and `CuDNNGRU`) biases are defined and initialized differently - something that can't be inferred from histograms
**EX 11: uni-CuDNNGRU, 64 units, weights gradients** -- `batch_shape = (16, 100, 16)` (input)
`rnn_heatmap(model, 'gru', mode='grads', input_data=x, labels=y, cmap=None, absolute_value=True)`- We may wish to visualize _gradient intensity_, which can be done via `absolute_value=True` and a greyscale colormap
- Gate separations are apparent even without explicit separating lines in this example:
- `New` is the most active kernel gate (input-to-hidden), suggesting more error correction on _permitting information flow_
- `Reset` is the least active recurrent gate (hidden-to-hidden), suggesting least error correction on memory-keeping
**EX 12: NaN detection: LSTM, 512 units, weights** -- `batch_shape = (16, 100, 16)` (input)
- Both the heatmap and the histogram come with built-in NaN detection - kernel-, gate-, and direction-wise
- Heatmap will print NaNs to console, whereas histogram will mark them directly on the plot
- Both will set NaN values to zero before plotting; in example below, all related non-NaN weights were already zero
**EX 13: Sparse Conv1D autoencoder weights** -- `w = layer.get_weights()[0]; w.shape == (16, 64, 128)`
`features_2D(w, n_rows=16, norm=(-.1, .1), tight=True, borderwidth=1, title_mode=title)`
`# title = "((Layer Channels vs. Kernels) vs. Weights) vs. Input Channels -- norm = (-0.1, 0.1)"`- One of stacked `Conv1D` sparse autoencoder layers; network trained with `Dropout(0.5, noise_shape=(batch_size, 1, channels))` (Spatial Dropout), encouraging sparse features which may benefit classification
- Weights are seen to be 'sparse'; some are uniformly low, others uniformly large, others have bands of large weights among lows
## Usage
**QUICKSTART**: run [sandbox.py](https://github.com/OverLordGoldDragon/see-rnn/blob/master/sandbox.py), which includes all major examples and allows easy exploration of various plot configs.
_Note_: if using `tensorflow.keras` imports, set `import os; os.environ["TF_KERAS"]='1'`. Minimal example below.
[visuals_gen.py](https://github.com/OverLordGoldDragon/see-rnn/blob/master/see_rnn/visuals_gen.py) functions can also be used to visualize `Conv1D` activations, gradients, or any other meaningfully-compatible data formats. Likewise, [inspect_gen.py](https://github.com/OverLordGoldDragon/see-rnn/blob/master/see_rnn/inspect_gen.py) also works for non-RNN layers.
```python
import numpy as np
from keras.layers import Input, LSTM
from keras.models import Model
from keras.optimizers import Adam
from see_rnn import get_gradients, features_0D, features_1D, features_2Ddef make_model(rnn_layer, batch_shape, units):
ipt = Input(batch_shape=batch_shape)
x = rnn_layer(units, activation='tanh', return_sequences=True)(ipt)
out = rnn_layer(units, activation='tanh', return_sequences=False)(x)
model = Model(ipt, out)
model.compile(Adam(4e-3), 'mse')
return modeldef make_data(batch_shape):
return np.random.randn(*batch_shape), \
np.random.uniform(-1, 1, (batch_shape[0], units))def train_model(model, iterations, batch_shape):
x, y = make_data(batch_shape)
for i in range(iterations):
model.train_on_batch(x, y)
print(end='.') # progbar
if i % 40 == 0:
x, y = make_data(batch_shape)units = 6
batch_shape = (16, 100, 2*units)model = make_model(LSTM, batch_shape, units)
train_model(model, 300, batch_shape)x, y = make_data(batch_shape)
grads_all = get_gradients(model, 1, x, y) # return_sequences=True, layer index 1
grads_last = get_gradients(model, 2, x, y) # return_sequences=False, layer index 2features_1D(grads_all, n_rows=2, show_xy_ticks=[1,1])
features_2D(grads_all, n_rows=8, show_xy_ticks=[1,1], norm=(-.01, .01))
features_0D(grads_last)
```# How to cite
Short form:
> John Muradeli, see-rnn, 2019. GitHub repository, https://github.com/OverLordGoldDragon/see-rnn/. DOI: 10.5281/zenodo.5080359
BibTeX:
```bibtex
@article{OverLordGoldDragon2019see-rnn,
title={See RNN},
author={John Muradeli},
journal={GitHub. Note: https://github.com/OverLordGoldDragon/see-rnn/},
year={2019},
doi={10.5281/zenodo.5080359},
}
```[1]: https://i.stack.imgur.com/PVoU0.png
[2]: https://i.stack.imgur.com/OaX6I.png
[3]: https://i.stack.imgur.com/RW24R.png
[4]: https://i.stack.imgur.com/SUIN3.png
[5]: https://i.stack.imgur.com/nsNR1.png
[6]: https://i.stack.imgur.com/Ci2AP.png
[7]: https://i.stack.imgur.com/vWgc8.png