https://github.com/ragulpr/taildropout

Improving neural networks by enforcing co-adaptation of feature detectors
https://github.com/ragulpr/taildropout

Last synced: 2 months ago
JSON representation

Improving neural networks by enforcing co-adaptation of feature detectors

Host: GitHub
URL: https://github.com/ragulpr/taildropout
Owner: ragulpr
License: mit
Created: 2018-03-21T03:14:25.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2025-02-19T22:45:31.000Z (2 months ago)
Last Synced: 2025-02-19T23:22:43.810Z (2 months ago)
Language: Python
Homepage:
Size: 2.1 MB
Stars: 0
Watchers: 4
Forks: 0
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # TailDropout ![example workflow](https://github.com/ragulpr/taildropout/actions/workflows/tests.yml/badge.svg)

**"Improving neural networks by *enforcing* co-adaptation of feature detectors"**

#### Compression as an Optimization Problem

Imagine starting from an arbitrary layer of a neural network with input vector $h$ of dimension $n$:

$$

y = NN(h) 

$$

To set "compression" as an optimization problem we could pose it as 

> *"Hit the target as close as possible using* either $k=1,2,\dots$ or all $n$ *features"*

I.e learning a representation that is incrementally better the more features you add. Let's describe this as explicitly minimizing the weighted sum of the $n$ losses:

$$

\text{loss} = \sum_k^n \left\| y - NN\left(h \odot \mathbf{\vec{1}}_{k}\right) \right\|

$$

where $\mathbf{\vec{1}}_k$ is a binary mask zeroing out the vector "tail" after the $k$'th feature:

$$

\mathbf{\vec{1}}_k = 

\begin{pmatrix}

1 & 1 & 1 & 0 & 0 & \cdots & 0

\end{pmatrix}^T

$$

This would be a lot of forward passes (1 per feature) so what if we instead randomly sample $k$ with probability $p_k$:

$$

\underline{\overline{k}} \sim  \left\\{1,2,\dots,n \right\\}

$$

Doing so we see that in expectation (=large batchsize) we approximate the original objective:

$$

\mathbb{E}[\text{loss}] = \mathbb{E}\left[\left\| y - NN\left(h \odot \mathbf{\vec{1}}_{\underline{\overline{k}}}\right) \right\|\right]

$$

$$

 = \sum_k^n p_k \left\| y - NN\left(h * \mathbf{\vec{1}}_{k}\right) \right\| \\

$$

**And that's all there is to it!** 

I'll add details how we sample $k$ but the gist is that we sample a truncated (censored) exponential distribution [which I enjoy](https://github.com/ragulpr/wtte-rnn) doing and where this idea started from.

## Usage

TailDropout is a `nn.Module` with the same API as `nn.Dropout`, applied to a tensor `x`: 

```python

from taildropout import TailDropout

dropout = TailDropout(p=0.5,batch_dim=0, dropout_dim=-1)

y = dropout(x)

```

At training time, keep a random `k` first features. Results are as expected; this makes a layer learn features that are of additive importance, like PCA. 

See [example.ipynb](example.ipynb) for complete examples.

To use it for pruning or estimating the optimal size of hidden dim, calculate n_features vs loss and create a [scree plot](https://en.wikipedia.org/wiki/Scree_plot):

```python

losses = []

for k in range(n_features):

  model.dropout.set_k(k)

  losses.append(criterion(y, model(x)))

plt.plot(range(n_features), losses)

plt.title("Loss vs n_features used")

```

I'm happy to release this since I've found it very useful over the years. I've used it for 

* Estimating the optimal \#features per layer

* In place of dropout for regularization

* To be able to choose a model size (after training to overfit!) that generalizes.

* For fiddling with neural networks. (*"mechanistic interpretability"*)

The implementation is faster than `nn.Dropout`, supports multi-GPU and *torch.compile()*'s.

## Matrix multiplication 101

At each layer, a scalar input *feature* `x[j]` of a feature vector `x` decides how far to map input into the direction `W[:,j]` of the layer output space. This is done by `W[:,j]*x[j]`:

![](./_figs/taildropout.gif)

### TailDropout: While training, randomly sample k

Teach each **k first** directions to map input to target as good as possible.

![](./_figs/taildropout_random.gif)

Each direction has decreasing probability of being used.

### Compare to regular dropout

Teach each $2^n$ **subset of directions** to map input to targets as good as possible.

![](./_figs/dropout.gif)

Each direction in W has same inclusion probability but there's $2^n$ combinations to learn.

Regular dropout [scales](https://pytorch.org/docs/stable/_modules/torch/nn/modules/dropout.html#Dropout) input by $\frac{1}{1-p}$ in `.eval()` mode meaning with $p=0.5$ we could train for an output magnitude ex $[0,2]$ but do inference on ex $[0,1]$ - a cause of much confusion and bugs. TailDropout does not scale differently between train / test.

### Comparison to PCA

If `W` is some weights, then the SVD compression (same as PCA) is

```

U, s, V = SVD(W)

assert W == U @ s @ V

```

 

```python

W = torch.randn([2,10])

U, s, V = torch.linalg.svd(W)

s = torch.hstack([torch.diag(s), torch.zeros(2, 8)])

torch.testing.assert_close(

    W,

    U @ s @ V

)

```

With `s` the eigenvalues of `W`. To use the `k` first factors/ components/ eigenvectors to represent `W`, set `s[k:]=0`

![](./_figs/svd.gif)

Note that SVD compresses `W` optimally w.r.t the **Euclidian** (L2) norm for every `k`:

$$

U, s, V = \arg\min_{U, s, V} \left\| Wx - U \, \text{diag}\left(s \odot \mathbf{\vec{1}}_{k}\right) V'x \right\|

$$

*but you want to compress each layer w.r.t the final loss function and lots of non-linearities in between!*

### Example AutoEncoder; Sequential compression.

When using TailDropout on the embedding layer, `k` controlls the compression rate:

![TailDropout](./_figs/ae-taildropout.gif)

Here even with `k=1` the resulting 1d-scalar embedding apparently separates shoes and shirts. 

Compare this to how regular dropout works. Well, it's quite more random.

![Regular dropout](./_figs/ae-dropout.gif)

## Details

#### Training vs Inference

```python

dropout = TailDropout()

dropout.train()

dropout(x) # random

dropout.eval() 

dropout(x) # Identity function

dropout.set_k(k)

dropout(x) # use first k features 

```

#### Images

"2d Dropout" == Keep mask constant over spatial dimension. Popular approach.

```python

x = torch.randn(n_batch,n_features,n_pixels_x,n_pixels_y)

cnn = nn.Conv2d(n_features,n_features, kernel_size)

taildropout = TailDropout(batch_dim = 0, dropout_dim = 1)

x = cnn(x)

x = taildropout(x)

```

##### Compression/regularization ratio is very large!

If you don't care much about regularization, dropout probability in order 1e-5 still 

seems to give good compression effect. I typically use `TailDropout(p=0.001)` to get both. 

#### Citation

```

@misc{Martinsson2018,

  author = {Egil Martinsson},

  title = {TailDropout},

  year = {2018},

  publisher = {GitHub},

  journal = {GitHub repository},

  howpublished = {\url{https://github.com/naver/taildropout}},

  commit = {master}

}

```

#### Aknowledgments

This work was open sourced 2025 but work primarily done in 2018 at [Naver Clova/Clair](https://research.clova.ai/). Big thanks to [Minjoon Seo](https://seominjoon.github.io/) for the original inspiration from his work on [Skim-RNN](https://arxiv.org/abs/1711.02085) and [Ji-Hoon Kim](https://scholar.google.co.kr/citations?user=1KdhN5QAAAAJ&hl=ko) [Adrian Kim](https://scholar.google.co.kr/citations?user=l6lDgpgAAAAJ&hl=ko), [Jaesung Huh

](https://scholar.google.com/citations?user=VDMZ-pQAAAAJ&hl=en), [Prof. Jung-Woo Ha](https://scholar.google.com/citations?user=eGj3ay4AAAAJ&hl=en) and [Prof. Sung Kim](https://scholar.google.com/citations?user=JE_m2UgAAAAJ&hl=en) for valuable discussions and feedback.

I'm sure this simple idea has been implemented before 2018 (which I was unaware of at the time) or after (which I have not had time to look for). Please let me know if there's anything relevant I should cite.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ragulpr/taildropout

Awesome Lists containing this project

README