{"id":23910056,"url":"https://github.com/ragulpr/taildropout","last_synced_at":"2025-02-23T16:46:31.214Z","repository":{"id":270931815,"uuid":"126116515","full_name":"ragulpr/taildropout","owner":"ragulpr","description":"Improving neural networks by enforcing co-adaptation of feature detectors","archived":false,"fork":false,"pushed_at":"2025-02-19T22:45:31.000Z","size":2205,"stargazers_count":0,"open_issues_count":8,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-02-19T23:22:43.810Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ragulpr.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-21T03:14:25.000Z","updated_at":"2025-02-03T08:11:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"6f5d8186-89ab-453d-b3a0-bb70cc5d62df","html_url":"https://github.com/ragulpr/taildropout","commit_stats":null,"previous_names":["ragulpr/taildropout"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragulpr%2Ftaildropout","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragulpr%2Ftaildropout/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragulpr%2Ftaildropout/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ragulpr%2Ftaildropout/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ragulpr","download_url":"https://codeload.github.com/ragulpr/taildropout/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240347782,"owners_count":19787231,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-05T06:39:34.843Z","updated_at":"2025-02-23T16:46:31.209Z","avatar_url":"https://github.com/ragulpr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TailDropout ![example workflow](https://github.com/ragulpr/taildropout/actions/workflows/tests.yml/badge.svg)\n**\"Improving neural networks by *enforcing* co-adaptation of feature detectors\"**\n\n#### Compression as an Optimization Problem\n\nImagine starting from an arbitrary layer of a neural network with input vector $h$ of dimension $n$:\n\n$$\ny = NN(h) \n$$\n\nTo set \"compression\" as an optimization problem we could pose it as \n\n\u003e *\"Hit the target as close as possible using* either $k=1,2,\\dots$ or all $n$ *features\"*\n\nI.e learning a representation that is incrementally better the more features you add. Let's describe this as explicitly minimizing the weighted sum of the $n$ losses:\n\n$$\n\\text{loss} = \\sum_k^n \\left\\| y - NN\\left(h \\odot \\mathbf{\\vec{1}}_{k}\\right) \\right\\|\n$$\n\nwhere $\\mathbf{\\vec{1}}_k$ is a binary mask zeroing out the vector \"tail\" after the $k$'th feature:\n\n$$\n\\mathbf{\\vec{1}}_k = \n\\begin{pmatrix}\n1 \u0026 1 \u0026 1 \u0026 0 \u0026 0 \u0026 \\cdots \u0026 0\n\\end{pmatrix}^T\n$$\n\nThis would be a lot of forward passes (1 per feature) so what if we instead randomly sample $k$ with probability $p_k$:\n\n$$\n\\underline{\\overline{k}} \\sim  \\left\\\\{1,2,\\dots,n \\right\\\\}\n$$\n\nDoing so we see that in expectation (=large batchsize) we approximate the original objective:\n\n$$\n\\mathbb{E}[\\text{loss}] = \\mathbb{E}\\left[\\left\\| y - NN\\left(h \\odot \\mathbf{\\vec{1}}_{\\underline{\\overline{k}}}\\right) \\right\\|\\right]\n$$\n\n$$\n = \\sum_k^n p_k \\left\\| y - NN\\left(h * \\mathbf{\\vec{1}}_{k}\\right) \\right\\| \\\\\n$$\n\n\n**And that's all there is to it!** \n\nI'll add details how we sample $k$ but the gist is that we sample a truncated (censored) exponential distribution [which I enjoy](https://github.com/ragulpr/wtte-rnn) doing and where this idea started from.\n\n## Usage\nTailDropout is a `nn.Module` with the same API as `nn.Dropout`, applied to a tensor `x`: \n```python\nfrom taildropout import TailDropout\ndropout = TailDropout(p=0.5,batch_dim=0, dropout_dim=-1)\ny = dropout(x)\n```\nAt training time, keep a random `k` first features. Results are as expected; this makes a layer learn features that are of additive importance, like PCA. \n\nSee [example.ipynb](example.ipynb) for complete examples.\n\u003c!-- If we apply it to a linear network we actually learn PCA [TODO LINK](). --\u003e\n\nTo use it for pruning or estimating the optimal size of hidden dim, calculate n_features vs loss and create a [scree plot](https://en.wikipedia.org/wiki/Scree_plot):\n\n```python\nlosses = []\nfor k in range(n_features):\n  model.dropout.set_k(k)\n  losses.append(criterion(y, model(x)))\n\nplt.plot(range(n_features), losses)\nplt.title(\"Loss vs n_features used\")\n```\n\nI'm happy to release this since I've found it very useful over the years. I've used it for \n* Estimating the optimal \\#features per layer\n* In place of dropout for regularization\n* To be able to choose a model size (after training to overfit!) that generalizes.\n* For fiddling with neural networks. (*\"mechanistic interpretability\"*)\n\nThe implementation is faster than `nn.Dropout`, supports multi-GPU and *torch.compile()*'s.\n\n## Matrix multiplication 101\nAt each layer, a scalar input *feature* `x[j]` of a feature vector `x` decides how far to map input into the direction `W[:,j]` of the layer output space. This is done by `W[:,j]*x[j]`:\n\n![](./_figs/taildropout.gif)\n### TailDropout: While training, randomly sample k\nTeach each **k first** directions to map input to target as good as possible.\n![](./_figs/taildropout_random.gif)\n\nEach direction has decreasing probability of being used.\n\n### Compare to regular dropout\nTeach each $2^n$ **subset of directions** to map input to targets as good as possible.\n![](./_figs/dropout.gif)\n\nEach direction in W has same inclusion probability but there's $2^n$ combinations to learn.\n\nRegular dropout [scales](https://pytorch.org/docs/stable/_modules/torch/nn/modules/dropout.html#Dropout) input by $\\frac{1}{1-p}$ in `.eval()` mode meaning with $p=0.5$ we could train for an output magnitude ex $[0,2]$ but do inference on ex $[0,1]$ - a cause of much confusion and bugs. TailDropout does not scale differently between train / test.\n\n### Comparison to PCA\nIf `W` is some weights, then the SVD compression (same as PCA) is\n```\nU, s, V = SVD(W)\nassert W == U @ s @ V\n```\n\u003cdetails\u003e \u003csummary\u003e\u003c/summary\u003e\n\n```python\nW = torch.randn([2,10])\nU, s, V = torch.linalg.svd(W)\n\ns = torch.hstack([torch.diag(s), torch.zeros(2, 8)])\n\ntorch.testing.assert_close(\n    W,\n    U @ s @ V\n)\n```\n\n\u003c/details\u003e\n\nWith `s` the eigenvalues of `W`. To use the `k` first factors/ components/ eigenvectors to represent `W`, set `s[k:]=0`\n\n![](./_figs/svd.gif)\n\nNote that SVD compresses `W` optimally w.r.t the **Euclidian** (L2) norm for every `k`:\n\u003c!-- ```\n||W - U[:,:k] diag(s[:k]) V[:,:k]'||\n``` --\u003e\n\n$$\nU, s, V = \\arg\\min_{U, s, V} \\left\\| Wx - U \\, \\text{diag}\\left(s \\odot \\mathbf{\\vec{1}}_{k}\\right) V'x \\right\\|\n$$\n\n*but you want to compress each layer w.r.t the final loss function and lots of non-linearities in between!*\n\n### Example AutoEncoder; Sequential compression.\nWhen using TailDropout on the embedding layer, `k` controlls the compression rate:\n\n![TailDropout](./_figs/ae-taildropout.gif)\n\nHere even with `k=1` the resulting 1d-scalar embedding apparently separates shoes and shirts. \n\nCompare this to how regular dropout works. Well, it's quite more random.\n![Regular dropout](./_figs/ae-dropout.gif)\n\n\n## Details\n#### Training vs Inference\n```python\ndropout = TailDropout()\ndropout.train()\ndropout(x) # random\ndropout.eval() \ndropout(x) # Identity function\ndropout.set_k(k)\ndropout(x) # use first k features \n```\n\u003c!-- \n#### Sequences\n\"Recurrent dropout\" == Keep mask constant over time. Popular approach.\n```python\nx = torch.randn(n_timesteps,n_sequences,n_features)\n\ngru = nn.GRU(n_features,n_features)\ntaildropout = TailDropout(batch_dim = 1, dropout_dim = 2)\n\nx, _ = gru(x)\nx = taildropout(x)\n```\nIf you want to have mask vary for each timestep and sequence\n```python\ntaildropout = TailDropout(batch_dim = [0,1], dropout_dim = 2)\n``` --\u003e\n\n#### Images\n\"2d Dropout\" == Keep mask constant over spatial dimension. Popular approach.\n```python\nx = torch.randn(n_batch,n_features,n_pixels_x,n_pixels_y)\n\ncnn = nn.Conv2d(n_features,n_features, kernel_size)\ntaildropout = TailDropout(batch_dim = 0, dropout_dim = 1)\n\nx = cnn(x)\nx = taildropout(x)\n```\n\n\u003c!-- #### BatchNorm\nSame as with regular dropout; batchnorm *before* dropout.\n```python\nlayer = nn.Sequential(\n    nn.Linear(n_features,n_features),\n    nn.BatchNorm1d(n_features),\n    nn.ReLU(),\n    TailDropout()\n    )\n``` --\u003e\n\n##### Compression/regularization ratio is very large!\nIf you don't care much about regularization, dropout probability in order 1e-5 still \nseems to give good compression effect. I typically use `TailDropout(p=0.001)` to get both. \n\n\u003c!-- #### Math\nTODO\nIntuitively, “earlier” features survive more often, while “later” features get zeroed‐out more often.\nIf we want `mask = dropout(x)` to have `mask.mean() == dropout.p`. It's incentivizing the model to learn an ordering of the features by importance.\n\nThink of\n```\nF(k) = probability k\u003c= F\n``` --\u003e\n\n#### Citation\n```\n@misc{Martinsson2018,\n  author = {Egil Martinsson},\n  title = {TailDropout},\n  year = {2018},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/naver/taildropout}},\n  commit = {master}\n}\n```\n\n#### Aknowledgments\nThis work was open sourced 2025 but work primarily done in 2018 at [Naver Clova/Clair](https://research.clova.ai/). Big thanks to [Minjoon Seo](https://seominjoon.github.io/) for the original inspiration from his work on [Skim-RNN](https://arxiv.org/abs/1711.02085) and [Ji-Hoon Kim](https://scholar.google.co.kr/citations?user=1KdhN5QAAAAJ\u0026hl=ko) [Adrian Kim](https://scholar.google.co.kr/citations?user=l6lDgpgAAAAJ\u0026hl=ko), [Jaesung Huh\n](https://scholar.google.com/citations?user=VDMZ-pQAAAAJ\u0026hl=en), [Prof. Jung-Woo Ha](https://scholar.google.com/citations?user=eGj3ay4AAAAJ\u0026hl=en) and [Prof. Sung Kim](https://scholar.google.com/citations?user=JE_m2UgAAAAJ\u0026hl=en) for valuable discussions and feedback.\n\nI'm sure this simple idea has been implemented before 2018 (which I was unaware of at the time) or after (which I have not had time to look for). Please let me know if there's anything relevant I should cite.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fragulpr%2Ftaildropout","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fragulpr%2Ftaildropout","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fragulpr%2Ftaildropout/lists"}