{"id":13993862,"url":"https://github.com/Ma-Lab-Berkeley/CRATE","last_synced_at":"2025-07-22T18:32:12.963Z","repository":{"id":172595141,"uuid":"647034317","full_name":"Ma-Lab-Berkeley/CRATE","owner":"Ma-Lab-Berkeley","description":"Code for CRATE (Coding RAte reduction TransformEr).","archived":false,"fork":false,"pushed_at":"2024-10-23T15:24:28.000Z","size":58553,"stargazers_count":1183,"open_issues_count":9,"forks_count":97,"subscribers_count":21,"default_branch":"main","last_synced_at":"2024-11-20T21:12:11.483Z","etag":null,"topics":["compression","sparsification","transformer-architecture","white-box-architecture"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Ma-Lab-Berkeley.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-29T23:26:28.000Z","updated_at":"2024-11-15T03:32:06.000Z","dependencies_parsed_at":"2024-07-26T16:21:27.308Z","dependency_job_id":null,"html_url":"https://github.com/Ma-Lab-Berkeley/CRATE","commit_stats":null,"previous_names":["ma-lab-berkeley/crate"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ma-Lab-Berkeley%2FCRATE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ma-Lab-Berkeley%2FCRATE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ma-Lab-Berkeley%2FCRATE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Ma-Lab-Berkeley%2FCRATE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Ma-Lab-Berkeley","download_url":"https://codeload.github.com/Ma-Lab-Berkeley/CRATE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227156410,"owners_count":17739298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","sparsification","transformer-architecture","white-box-architecture"],"created_at":"2024-08-09T14:02:35.963Z","updated_at":"2024-11-29T15:31:25.573Z","avatar_url":"https://github.com/Ma-Lab-Berkeley.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# CRATE (Coding RAte reduction TransformEr)\nThis repository is the official PyTorch implementation of the papers: \n\n- **White-Box Transformers via Sparse Rate Reduction** [**NeurIPS-2023**, [paper link](https://openreview.net/forum?id=THfl8hdVxH#)]. By [Yaodong Yu](https://yaodongyu.github.io) (UC Berkeley), [Sam Buchanan](https://sdbuchanan.com) (TTIC), [Druv Pai](https://druvpai.github.io) (UC Berkeley), [Tianzhe Chu](https://tianzhechu.com/) (UC Berkeley), [Ziyang Wu](https://robinwu218.github.io/) (UC Berkeley), [Shengbang Tong](https://tsb0601.github.io/petertongsb/) (UC Berkeley), [Benjamin D Haeffele](https://www.cis.jhu.edu/~haeffele/#about) (Johns Hopkins University), and [Yi Ma](http://people.eecs.berkeley.edu/~yima/) (UC Berkeley). \n- **Emergence of Segmentation with Minimalistic White-Box Transformers** [**CPAL-2024**, [paper link](https://arxiv.org/abs/2308.16271)]. By [Yaodong Yu](https://yaodongyu.github.io)* (UC Berkeley),  [Tianzhe Chu](https://tianzhechu.com/)* (UC Berkeley \u0026 ShanghaiTech U), [Shengbang Tong](https://tsb0601.github.io/petertongsb/) (UC Berkeley \u0026 NYU), [Ziyang Wu](https://robinwu218.github.io/) (UC Berkeley),  [Druv Pai](https://druvpai.github.io) (UC Berkeley),  [Sam Buchanan](https://sdbuchanan.com) (TTIC), and [Yi Ma](http://people.eecs.berkeley.edu/~yima/) (UC Berkeley \u0026 HKU). 2023. (* equal contribution)\n- **Masked Autoencoding via Structured Diffusion with White-Box Transformers** [**ICLR-2024**, [paper link](https://arxiv.org/abs/2404.02446)]. By [Druv Pai](https://druvpai.github.io) (UC Berkeley), [Ziyang Wu](https://robinwu218.github.io/) (UC Berkeley), [Sam Buchanan](https://sdbuchanan.com), [Yaodong Yu](https://yaodongyu.github.io) (UC Berkeley), and [Yi Ma](http://people.eecs.berkeley.edu/~yima/) (UC Berkeley).\n\nAlso, we have released a larger journal-length overview paper of this line of research, which contains a superset of all the results presented above, and also more results in NLP and vision SSL.\n- **White-Box Transformers via Sparse Rate Reduction: Compression is All There Is?** [[paper link](https://arxiv.org/abs/2311.13110)]. By [Yaodong Yu](https://yaodongyu.github.io) (UC Berkeley), [Sam Buchanan](https://sdbuchanan.com) (TTIC), [Druv Pai](https://druvpai.github.io) (UC Berkeley), [Tianzhe Chu](https://tianzhechu.com/) (UC Berkeley), [Ziyang Wu](https://robinwu218.github.io/) (UC Berkeley), [Shengbang Tong](https://tsb0601.github.io/petertongsb/) (UC Berkeley), [Hao Bai](https://www.jackgethome.com/) (UIUC), [Yuexiang Zhai](https://yx-s-z.github.io/) (UC Berkeley), [Benjamin D Haeffele](https://www.cis.jhu.edu/~haeffele/#about) (Johns Hopkins University), and [Yi Ma](http://people.eecs.berkeley.edu/~yima/) (UC Berkeley).\n\n\n# Table of Contents\n\n* [CRATE (Coding RAte reduction TransformEr)](#crate-coding-rate-reduction-transformer)\n    * [Theoretical Background: What is CRATE?](#theoretical-background-what-is-crate)\n      * [1. CRATE Architecture overview](#1-crate-architecture-overview)\n      * [2. One layer/block of CRATE](#2-one-layerblock-of-crate)\n      * [3. Per-layer optimization in CRATE](#3-per-layer-optimization-in-crate)\n      * [4. Segmentation visualization of CRATE](#4-segmentation-visualization-of-crate)\n   * [Autoencoding](#autoencoding)\n* [Implementation and experiments](#implementation-and-experiments)\n   * [Constructing a CRATE model](#constructing-a-crate-model)\n      * [Pre-trained Checkpoints (ImageNet-1K)](#pre-trained-checkpoints-imagenet-1k)\n   * [Training CRATE on ImageNet](#training-crate-on-imagenet)\n   * [Finetuning pretrained / training random initialized CRATE on CIFAR10](#finetuning-pretrained--training-random-initialized-crate-on-cifar10)\n   * [Demo: Emergent segmentation in CRATE](#demo-emergent-segmentation-in-crate)\n   * [Constructing a CRATE autoencoding model](#constructing-a-crate-autoencoding-model)\n      * [Pre-trained Checkpoints (ImageNet-1K)](#pre-trained-checkpoints-imagenet-1k-1)\n   * [Training/Fine-Tuning CRATE-MAE](#trainingfine-tuning-crate-mae)\n* [Reference](#reference)\n\n## Theoretical Background: What is CRATE?\nCRATE (Coding RAte reduction TransformEr) is a white-box (mathematically interpretable) transformer architecture, where each layer performs a single step of an alternating minimization algorithm to optimize the **sparse rate reduction objective**\n \u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_objective.png\" width=\"400\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\nwhere $R$ and $R^{c}$ are different _coding rates_ for the input representations w.r.t.~different codebooks, and the $\\ell^{0}$-norm promotes the sparsity of the final token representations $\\boldsymbol{Z} = f(\\boldsymbol{X})$. The function $f$ is defined as \n$$f=f^{L} \\circ f^{L-1} \\circ \\cdots \\circ f^{1} \\circ f^{\\mathrm{pre}},$$\nwhere $f^{\\mathrm{pre}}$ is the pre-processing mapping, and $f^{\\ell}$ is the $\\ell$-th layer forward mapping that transforms the token distribution to optimize the above sparse rate reduction objective incrementally. More specifically, $f^{\\ell}$ transforms the $\\ell$-th layer token representations $\\boldsymbol{Z}^{\\ell}$ to  $\\boldsymbol{Z}^{\\ell+1}$ via the $\\texttt{MSSA}$ (Multi-Head Subspace Self-Attention) block and the $\\texttt{ISTA}$ (Iterative Shrinkage-Thresholding Algorithms) block, i.e.,\n$$\\boldsymbol{Z}^{\\ell+1} = f^{\\ell}(\\boldsymbol{Z}^{\\ell}) = \\texttt{ISTA}(\\boldsymbol{Z}^{\\ell} + \\texttt{MSSA}(\\boldsymbol{Z}^{\\ell})).$$\n\n### 1. CRATE Architecture overview\n\nThe following figure presents an overview of the pipeline for our proposed **CRATE** architecture:\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_pipeline.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\n### 2. One layer/block of CRATE\n\nThe following figure shows the overall architecture of one layer of **CRATE** as the composition of $\\texttt{MSSA}$ and $\\texttt{ISTA}$ blocks.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_arch.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\n### 3. Per-layer optimization in CRATE\n\nIn the following figure, we measure the compression term [ $R^{c}$ ($\\boldsymbol{Z}^{\\ell+1/2}$) ] and the sparsity term [ $||\\boldsymbol{Z}^{\\ell+1}||_0$ ] defined in the **sparse rate reduction objective**, and we find that each layer of **CRATE** indeed optimizes the targeted objectives, showing that our white-box theoretical design is predictive of practice.\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_layerwise.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\n### 4. Segmentation visualization of CRATE\nIn the following figure, we visualize self-attention maps from a supervised **CRATE** model with 8x8 patches (similar to the ones shown in [DINO](https://github.com/facebookresearch/dino) :t-rex:).\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_seg.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\nWe also discover a surprising empirical phenomenon where each attention head in **CRATE** retains its own semantics.\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_seg_headwise.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\n\n## Autoencoding\n\nWe can also use our theory to build a principled autoencoder, which has the following architecture.\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_arch_autoencoder.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\nIt has many of the same empirical properties as the base **CRATE** model, such as segmented attention maps and amenability to layer-wise analysis. We train it on the masked autoencoding task (calling this model **CRATE-MAE**), and it achieves comparable performance in linear probing and reconstruction quality as the base ViT-MAE.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_masked_reconstruction.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\n\n# Implementation and Experiments\n\n## Constructing a CRATE model\nA CRATE model can be defined using the following code, (the below parameters are specified for CRATE-Tiny)\n```python\nfrom model.crate import CRATE\ndim = 384\nn_heads = 6\ndepth = 12\nmodel = CRATE(image_size=224,\n              patch_size=16,\n              num_classes=1000,\n              dim=dim,\n              depth=depth,\n              heads=n_heads,\n              dim_head=dim // n_heads)\n```\n\n### Pre-trained Checkpoints (ImageNet-1K)\n| model | `dim` | `n_heads` | `depth` | pre-trained checkpoint |\n| -------- | -------- | -------- | -------- | -------- | \n| **CRATE-T**(iny)    | 384   | 6   | 12 | TODO | \n| **CRATE-S**(mall)    | 576   | 12   | 12 | [download link](https://drive.google.com/file/d/1hYgDJl4EKHYfKprwhEjmWmWHuxnK6_h8/view?usp=share_link) | \n| **CRATE-B**(ase)    | 768   | 12   | 12 | TODO | \n| **CRATE-L**(arge) | 1024 | 16 | 24 | TODO | \n\n## Training CRATE on ImageNet\nTo train a CRATE model on ImageNet-1K, run the following script (training CRATE-tiny)\n\nAs an example, we use the following command for training CRATE-tiny on ImageNet-1K:\n```python\npython main.py \n  --arch CRATE_tiny \n  --batch-size 512 \n  --epochs 200 \n  --optimizer Lion \n  --lr 0.0002 \n  --weight-decay 0.05 \n  --print-freq 25 \n  --data DATA_DIR\n```\nand replace `DATA_DIR` with `[imagenet-folder with train and val folders]`.\n\n\n## Finetuning pretrained / training random initialized CRATE on CIFAR10\n\n```python\npython finetune.py \n  --bs 256 \n  --net CRATE_tiny \n  --opt adamW  \n  --lr 5e-5 \n  --n_epochs 200 \n  --randomaug 1 \n  --data cifar10 \n  --ckpt_dir CKPT_DIR \n  --data_dir DATA_DIR\n```\nReplace `CKPT_DIR` with the path for the pretrained CRATE weight, and replace `DATA_DIR` with the path for the `CIFAR10` dataset. If `CKPT_DIR` is `None`, then this script is for training CRATE from random initialization on CIFAR10.\n\n## Demo: Emergent segmentation in CRATE\n\nCRATE models exhibit emergent segmentation in their self-attention maps solely through supervised training.\nWe provide a Colab Jupyter notebook to visualize the emerged segmentations from a supervised **CRATE** model. The demo provides visualizations which match the segmentation figures above.\n\nLink: [crate-emergence.ipynb](https://colab.research.google.com/drive/1rYn_NlepyW7Fu5LDliyBDmFZylHco7ss?usp=sharing) (in colab)\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"figs/fig_seg_headwise.png\" width=\"900\"\\\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\n## Constructing a CRATE autoencoding model\nA CRATE-autoencoding model (specifically **CRATE-MAE-Base**) can be defined using the following code:\n```python\nfrom model.crate_ae.crate_ae import mae_crate_base\nmodel = mae_crate_base()\n```\nThe other sizes in the paper are also importable in that way. Modifying the `model/crate_ae/crate_ae.py` file will let you initialize and serve your own config.\n\n### Pre-trained Checkpoints (ImageNet-1K)\n| model | `dim` | `n_heads` | `depth` | pre-trained checkpoint |\n| -------- | -------- | -------- | -------- | -------- | \n| **CRATE-MAE-S**(mall)    | 576   | 12   | 12 | TODO | \n| **CRATE-MAE-B**(ase)    | 768   | 12   | 12 | [link](https://drive.google.com/file/d/11i5BMwymqOsunq44WD3omN5mS6ZREQPO/view?usp=sharing) | \n\n## Training/Fine-Tuning CRATE-MAE\nTo train or fine-tune a CRATE-MAE model on ImageNet-1K, please refer to the [codebase on MAE training](https://github.com/facebookresearch/mae) from Meta FAIR. The `models_mae.py` file in that codebase can be replaced with the contents of `model/crate_ae/crate_ae.py`, and the rest of the code should go through with minimal alterations.\n\n\n## Demo: Emergent segmentation in CRATE-MAE\n\nCRATE-MAE models also exhibit emergent segmentation in their self-attention maps.\nWe provide a Colab Jupyter notebook to visualize the emerged segmentations from a **CRATE-MAE** model. The demo provides visualizations which match the segmentation figures above.\n\nLink: [crate-mae.ipynb](https://colab.research.google.com/drive/1xcD-xcxprfgZuvwsRKuDroH7xMjr0Ad3?usp=sharing) (in colab)\n\n# Reference\nFor technical details and full experimental results, please check the [CRATE paper](https://arxiv.org/abs/2306.01129), [CRATE segmentation paper](https://arxiv.org/abs/2308.16271), [CRATE autoencoding paper](https://openreview.net/forum?id=PvyOYleymy), or [the long-form overview paper](https://arxiv.org/abs/2311.13110). Please consider citing our work if you find it helpful to yours:\n\n```\n@article{yu2024white,\n  title={White-Box Transformers via Sparse Rate Reduction},\n  author={Yu, Yaodong and Buchanan, Sam and Pai, Druv and Chu, Tianzhe and Wu, Ziyang and Tong, Shengbang and Haeffele, Benjamin and Ma, Yi},\n  journal={Advances in Neural Information Processing Systems},\n  volume={36},\n  year={2024}\n}\n```\n```\n@inproceedings{yu2024emergence,\n  title={Emergence of Segmentation with Minimalistic White-Box Transformers},\n  author={Yu, Yaodong and Chu, Tianzhe and Tong, Shengbang and Wu, Ziyang and Pai, Druv and Buchanan, Sam and Ma, Yi},\n  booktitle={Conference on Parsimony and Learning},\n  pages={72--93},\n  year={2024},\n  organization={PMLR}\n}\n```\n```\n@inproceedings{pai2024masked,\n  title={Masked Completion via Structured Diffusion with White-Box Transformers},\n  author={Pai, Druv and Buchanan, Sam and Wu, Ziyang and Yu, Yaodong and Ma, Yi},\n  booktitle={The Twelfth International Conference on Learning Representations},\n  year={2024}\n}\n```\n```\n@article{yu2023white,\n  title={White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?},\n  author={Yu, Yaodong and Buchanan, Sam and Pai, Druv and Chu, Tianzhe and Wu, Ziyang and Tong, Shengbang and Bai, Hao and Zhai, Yuexiang and Haeffele, Benjamin D and Ma, Yi},\n  journal={arXiv preprint arXiv:2311.13110},\n  year={2023}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMa-Lab-Berkeley%2FCRATE","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMa-Lab-Berkeley%2FCRATE","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMa-Lab-Berkeley%2FCRATE/lists"}