{"id":18107865,"url":"https://github.com/necrashter/transformers-learnable-memory","last_synced_at":"2026-03-15T06:54:44.458Z","repository":{"id":176049889,"uuid":"654719678","full_name":"necrashter/transformers-learnable-memory","owner":"necrashter","description":"Fine-tuning Image Transformers using Learnable Memory","archived":false,"fork":false,"pushed_at":"2023-06-20T12:54:31.000Z","size":1338,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-10-31T23:42:38.305Z","etag":null,"topics":["computer-vision","deep-learning","fine-tuning","transformers","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/necrashter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-16T19:26:52.000Z","updated_at":"2024-10-21T08:40:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"b88dbb8f-bbb7-49fa-a29d-7ae387539a9f","html_url":"https://github.com/necrashter/transformers-learnable-memory","commit_stats":null,"previous_names":["necrashter/transformers-learnable-memory"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/necrashter%2Ftransformers-learnable-memory","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/necrashter%2Ftransformers-learnable-memory/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/necrashter%2Ftransformers-learnable-memory/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/necrashter%2Ftransformers-learnable-memory/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/necrashter","download_url":"https://codeload.github.com/necrashter/transformers-learnable-memory/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230459508,"owners_count":18229441,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","fine-tuning","transformers","vision-transformer"],"created_at":"2024-10-31T23:41:44.784Z","updated_at":"2026-03-15T06:54:39.436Z","avatar_url":"https://github.com/necrashter.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# [Fine-tuning Image Transformers using Learnable Memory](https://arxiv.org/abs/2203.15243)\n\nThis README file is an outcome of the [CENG502 (Spring 2023)](https://ceng.metu.edu.tr/~skalkan/ADL/) project for reproducing a paper without an implementation. See [CENG502 (Spring 20223) Project List](https://github.com/CENG502-Projects/CENG502-Spring2023) for a complete list of all paper reproduction projects.\n\n# 1. Introduction\n\n[Fine-tuning Image Transformers using Learnable Memory](https://arxiv.org/abs/2203.15243) is a paper published in [CVPR 2022](https://openaccess.thecvf.com/content/CVPR2022/html/Sandler_Fine-Tuning_Image_Transformers_Using_Learnable_Memory_CVPR_2022_paper.html).\nThe proposed method introduces learnable memory tokens in each self-attention layer of Vision Transformer models, enabling non-destructive fine-tuning and preserving performance on previous tasks while adapting to new ones.\n\nIn this repository, we **implement this paper in PyTorch** and aim to **reproduce the results** with our limited computational resources.\n\n## 1.1. Paper summary\n\nThe main idea in the proposed method is to introduce **learnable memory tokens** in each self-attention layer.\nThese tokens don't attend to other tokens and they are discarded after the self-attention, but the other tokens attend to these tokens.\nFurthermore, the performance of the model on the previous dataset is preserved thanks to the proposed **attention masking strategy**.\nThus, this method increases the capacity of a pre-trained model in a non-destructive manner while avoiding the catastrophic forgetting problem that plagues the fine-tuning approaches.\nFinally, the attention masking allows us to concatenate separately fine-tuned models into a single model which enables the reuse of computation while running these models.\n\n# 2. The method and our interpretation\n\n## 2.1. The original method\n\n### 2.1.1. Memory tokens\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/fig1.png\" width=\"500\"\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003ci\u003eFigure 1. Memory tokens are concatenated to the input before each encoder layer. (Borrowed from the paper.)\u003c/i\u003e\u003c/p\u003e\n\nThe method builds on top of a regular transformer encoder layer.\nFirst, let's remember the input to a visual transformer (ViT):\n\n```math\n\\mathbf{z}^{vit}_0 := [x_{\\texttt{cls}}, E x_1, \\dots,E x_{N}] + E_{pos}\n```\n\nThis equation incorporates flattened image patches, denoted as $x_1 \\dots x_N$, that undergo processing through a learnable linear transformation $E$. The class token, represented by $x_{\\texttt{cls}}$, serves as a unique learnable token shared across inputs, with its output value serving as the embedding for final classification. Additionally, the equation includes the position embedding $E_{pos}$.\n\nIn order to enhance the transformer with memory, we introduce $m$ learnable memory embeddings $E_{mem} \\in \\mathbb{R}^{m \\times D}$ where $D$ represents the dimensionality of the input tokens.\nThese tokens are concatenated to the input as follows:\n\n```math\n\\mathbf{z}^{mem}_0 := [\\mathbf{z}^{vit}_0; E^0_{mem}]\n```\n\nAs a result, the transformer now receives a total of $N + 1 + m$ tokens.\nSubsequently, this input is passed through the transformer encoder layer, maintaining the same architecture as ViT.\nHowever, an important distinction is that the updated memory is not propagated, i.e., the output of the self-attention module is truncated to include only the first $N+1$ tokens .\nHence, the output of layer $l$, denoted as $\\mathbf{y}_l$, consists of solely $N+1$ tokens which correspond to the class token and the image patches.\n\nThe memory is incorporated to the subsequent layers similarly.\nGiven the truncated output of the previous layer $\\mathbf{y}_{l-1}$, the input to the layer $l$ is:\n\n```math\n\\mathbf{z}^{mem}_l = [\\mathbf{y}_{l-1}; E^l_{mem}]\n```\n\nThis process is illustrated in the following figure.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/fig2.png\" width=\"500\"\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003ci\u003eFigure 2a. Demonstration of how the encoder layer is modified to implement memory. (Borrowed from the paper.)\u003c/i\u003e\u003c/p\u003e\n\nEach memory token is randomly initialized with samples drawn from a normal distribution with a mean of 0 and a standard deviation of 0.02.\n\n### 2.1.2. Alternative ways of introducing memory\n\n[The previous work in this field](https://arxiv.org/abs/2006.11527) opted for propagating the memory after the self-attention layer instead of discarding it.\nAnother alternative is to propagate the memory and update it additively at each encoder layer.\nHowever, the authors found that propagating memory in any of these ways performs worse than the method proposed in the paper, i.e., concatenating different memory tokens in each layer and discarding them.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/fig2b.png\" width=\"500\"\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003ci\u003eFigure 2b. Demonstration of alternative ways to implement memory. (Borrowed from the paper.)\u003c/i\u003e\u003c/p\u003e\n\nNote that these approaches were not implemented in this repository since they are not introduced by this paper.\n\n\n### 2.1.3. Attention masking\n\nIf we fine-tune a pre-trained transformer model's class token on a new dataset or add memory, there is typically a decrease in performance on the original task.\nA popular way to address this problem is multi-task learning, which carries out the learning process on all datasets simultaneously.\nHowever, this approach is not always feasible due to practical constraints such as data ownership by separate entities.\n\nTo overcome this limitation, the authors propose the following **non-destructive fine-tuning method**:\n1. A new class token and a new per-task head is introduced alongside the memory.\n2. These newly added parameters are fine-tuned without modifying the original model parameters.\n3. An attention masking strategy is employed in the self-attention layers, which causes the original class token to remain the same even after the addition of new parameters and fine-tuning.\n\nThus, the fine-tuned model produces two outputs simultaneously: one for the original dataset (on which the model was pre-trained) and one for the new dataset (on which the model was fine-tuned).\nThe output for the original dataset is identical to the output from the unmodified pre-trained model.\nTherefore, this approach allows the reuse of not only parameters but also the computation since the fine-tuned model effectively emulates two models at the same time.\n\nFurthermore, it is possible to concatenate multiple models that are based on the same pre-trained model but fine-tuned separately on different datasets.\nThe output of the concatenated model will contain the output of each fine-tuned model.\nThis enables massive computation reuse at inference time since we only need to run one concatenated model instead of many fine-tuned models.\nModel concatenation process is depicted in the following figure.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/model-concat.png\" width=\"500\"\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003ci\u003eFigure 3. Concatenation of separately fine-tuned models. (Borrowed from the paper.)\u003c/i\u003e\u003c/p\u003e\n\nThe attention masking works by preventing the original model tokens from attending to the newly added tokens, thereby preserving the original model outputs.\nHowever, the new class token can freely attend to the old tokens.\nNote that the memory tokens don't attend to any other token since they are not passed on to the following layers.\nSee the table below for more information.\n\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/attention-mask.png\" width=\"500\"\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003ci\u003eTable 1. Token interactions in attention masking. (Borrowed from the paper.)\u003c/i\u003e\u003c/p\u003e\n\nIf the goal is to fine-tune an already fine-tuned model on another dataset, there are two different ways to implement the attention masking:\n1. **Model concatenation:** We can disallow interactions between the tokens added in the first fine-tuning and the second fine-tuning. This is equivalent to fine-tuning two models separately and concatenating them.\n2. **Model extension:** The tokens added in the second fine-tuning can attend to the tokens from the first fine-tuning (but not vice-versa since that would affect the output of the first fine-tuning).\n\n\n\n## 2.2. Our interpretation \n\nWe believe that our implementation is consistent with the method described in the paper, since the method is clearly explained and not much is left to interpretation.\nThis section will go into minor implementation details that were not given in the paper (rightly so, since too much detail can harm the brevity) and how we handled them.\n\n## 2.2.1. Background\n\nFirst, recall [how a self-attention layer works](https://arxiv.org/abs/1706.03762).\nInitially, we compute the $q$, $k$, $v$ (query, key, value) vectors by applying three separate linear projections to the input $z$:\n\n```math\nq = Q z\n```\n```math\nk = K z\n```\n```math\nv = V z\n```\n\nAn optional bias term can be added to these equations, which is omitted for brevity.\n\nAfter that, the scaled dot-product attention is applied:\n\n```math\n\\mathrm{Attention}(q, k, v) = \\mathrm{softmax}(\\frac{qk^T}{\\sqrt{d_k}})v\n```\n\nwhere $\\frac{1}{\\sqrt{d_k}}$ is the scaling factor.\n\n## 2.2.2. Attention masking\n\nBy default, the vision transformer in HuggingFace library concatenates the class token to the beginning.\nWe modified it so that the class tokens are appended to the end, and the memory tokens come after them.\nWith this modification, the attention mask becomes identical to the matrix given in Table 1.\n\nWe apply the masking before the softmax function.\nBecause if we apply masking after softmax by multiplying certain elements with 0, the attention scores will not add up to 1.\nSince the softmax operation exponentiates the inputs, the mask should be incorporated to the input using addition instead of multiplication.\nTherefore, the masked elements will have a value of `-inf` (which maps to 0 after exponentiation) and the rest will be 0, the additive identity.\n\n`build_attention_mask` in [`vit.py`](vit.py) constructs the attention mask given in Table 1, which is then added to the input of the softmax in the self-attention layer:\n\n```math\n\\mathrm{Attention}(q, k, v) = \\mathrm{softmax}(\\frac{qk^T}{\\sqrt{d_k}} + \\mathrm{mask})v\n```\n\nAlso note that memory tokens are not given in Table 1, since they don't attend to any other token.\nHowever, if we attempt to mask them as we did in the previous equation, all elements in one row of the attention mask will be `-inf`, which will cause a division by zero error in the softmax.\n\nTo fix this, we simply don't concatenate the memory tokens while calculating the queries $q$.\nFor the given input $z$, $z_{mem}$ is the input concatenated with the memory tokens $E_{mem}$ for this layer:\n\n```math\nz_{mem} :=  [z; E_{mem}]\n```\n\nThen, $q$, $k$, $v$ vectors are computed as follows:\n\n```math\nq = Q z\n```\n```math\nk = K z_{mem}\n```\n```math\nv = V z_{mem}\n```\n\nWith this change, the outer dot product $qk^T$ will yield a matrix that is exactly the same shape as the attention mask given in Table 1.\nThanks to this, the matrix constructed by `build_attention_mask` in [`vit.py`](vit.py) is exactly the same as Table 1 in terms of columns and rows: memory tokens are not present in the query rows.\n\nFurthermore, the matrix multiplication of attention scores (softmax output) and $v$ will naturally remove the memory tokens from the output since $q$ doesn't contain the memory tokens.\nThus, we don't need to do any additional truncation operation.\n\nThese are implemented the `forward` method of `SelfAttentionWithMemory`.\n\n\n# 3. Experiments and results\n\n## 3.1. Experimental setup\n\nIn this section, we will provide information about the base model, the datasets, and the training process.\n\n### 3.1.1. Base model\n\nIn the paper, the authors use ViT-B/32 base transformer model pre-trained on ImageNet-21K.\nThis is the case for all experiments, with the only exception being the experiment in which they compare different ViT architectures: ViT-B/32, ViT-B/16, and ViT-L/32.\nWe didn't aim to reproduce that particular experiment due to our limited resources; we focused on fine-tuning ViT-B/32.\nConsequently, we used [this ViT-B/32 model pre-trained on ImageNet-21K](https://huggingface.co/google/vit-base-patch32-224-in21k) from HuggingFace.\n\n### 3.1.2. Datasets\n\nIn the paper, the experiments were conducted on 4 distinct datasets: CIFAR-100, i-Naturalist, Places-365, and SUN-397.\nThe performance metric in these experiments was the accuracy of the models on the respective validation sets.\nThese 4 datasets can be easily found on the internet, and are also available in PyTorch's dataset module.\n\nHowever, the paper lacks specific implementation details regarding some of these datasets.\nFor example, it does not explain how the i-Naturalist and SUN-397 datasets were split into training and validation sets.\nTo address this, we randomly split these datasets with an 80-20 ratio to create the training and validation sets.\nMoreover, there are multiple versions of the Places-365 and i-Naturalist datasets in the PyTorch dataset package, but the paper does not specify which versions were used.\nFor our experiments, we utilized the standard version of the Places-365 dataset and the 2017 version of the i-Naturalist dataset.\n\n### 3.1.3. Training\n\nFor all fine-tuning experiments in the paper, the authors used SGD with Momentum, along with gradient clipping and a 5-step linear rate warmup.\nWe followed the same hyperparameters and settings in our experiments.\nThe paper adopted standard inception preprocessing for all datasets except for CIFAR-100, where random clipping was used.\nWe also followed these preprocessing steps for consistency.\nSimilarly, we initialized the memory tokens from the distribution of $\\mathcal{N}(0, 0.02)$, as described in the paper.\n\nThere are some differences between our experimentation setup and the paper's setup in terms of batch size and the number of fine-tuning steps.\nThe paper used a batch size of 512 and ran for 20000 steps.\nBecause of memory limitations, we had to use a batch size of 64, which would require us to train for 160000 steps in order to process the same amount of samples.\nHowever, due to limited resources and time constraints, we had to reduce the number of fine-tuning steps as well: 15640 steps for CIFAR-100, 84400 steps for i-Naturalist, 13600 steps for SUN-397, and 84560 steps for Places-365.\nAlthough our numbers of steps are significantly smaller, please note that the datasets are huge.\nFor instance, a single epoch on the Places-365 dataset (which constitutes 28180 steps when the batch size is 64) took over 13 hours.\nNevertheless, the paper mentioned that shorter runs generally yielded slightly worse results while preserving the relative relationships, and our results are comparable to those reported in the paper.\n\nIn our experiments, we only tested 1 cell memory token for all datasets and compared the results with the paper's findings using the same configuration.\nThe paper also discussed full fine-tuning and head+class token fine-tuning, but we did not include them in our implementation as they are not the main contributions of the paper.\nHowever, you can explore these types of fine-tuning options in our sample notebook.\n\n\n## 3.2. Running the code\n\n```\nDirectory structure:\n\t├── models\n\t│   ├── CIFAR100_model.pt\n\t│   ├── Places_model.pt\n\t│   ├── INaturalist_model.pt\n\t│   └── Sun_model.pt\n\t├── images\n\t├── mnist\n\t│\n\t├── download_models.sh\n\t├── vit_train.py\n\t├── vit_validation.py\n\t├── vit.py\n\t├── test_vit.py\n\t├── vit.ipynb [*]\n\t├── large-vit.ipynb [*]\n\t├── model_concatenate.ipynb [*]\n\t└── requirements.txt\n\nDataset Directory Structure:\n\t├── ceng502\n\t│   └── models--google--vit-base-patch32-224-in21k\n\t│\n\t└── datasets\n```\n- Files with `[*]` on their right are the notebooks for the examination of the implementation. \n- [`vit_train.py`](./vit_train.py) and [`vit_validation.py`](./vit_validation.py) are the scripts for training and validation respectively. and there are some arguments should be given.\n- [`vit.py`](./vit.py) is the module in which the memory token model is implemented, which is an essential part of this repository.\n- [`requirements.txt`](./requirements.txt) contains the list of required Python packages.\n- Unit tests are available in [`test_vit.py`](test_vit.py). See [section 3.2.4](#324-unit-tests).\n- In [the mnist directory](./mnist), there's a minimal ViT implementation in PyTorch from scratch with support for learnable memory tokens.\n  - This implementation is simpler and may be easier to understand.\n- `models` directory contains the fine-tuned learnable memory models for each dataset.\n  - The training script will save the models to this directory, and the validation script will load the models from here.\n  - The models that we have trained are available on [this HuggingFace repository](https://huggingface.co/necrashter/transformers-learnable-memory).\n  - [`download_models.sh`](./download_models.sh) is a Bash script for downloading our models from HuggingFace. See [section 3.2.2](#322-downloading-the-models) for more info.\n- Dataset Directory is where the cached base model and the datasets are stored.\n  - It should be declared as an argument in the training and the validation phases. Otherwise, the home directory will be used.\n  - Note that the datasets are quite large. You may want to use an HDD to store them as we did.\n  - The datasets will be downloaded if they cannot be found at the given directory.\n\n\n### 3.2.1 Installing the required packages\n\nBefore executing the scripts or the notebook, the required Python packages in `requirements.txt` should be installed.\nThis can be accomplished by running the following command:\n```bash\npip3 install -r requirements.txt\n```\n\n### 3.2.2. Downloading the models\n\nThe models we trained are available on [this HuggingFace repository](https://huggingface.co/necrashter/transformers-learnable-memory).\nSince these models have a large file size, we didn't include them here.\nHowever, we supply [a Bash script](./download_models.sh) for easily downloading the models from HuggingFace.\n\nRun the script using\n```bash\n./download_models.sh\n```\nIf that doesn't work, try:\n```bash\nbash ./download_models.sh\n```\n\nThis script requires the `wget` utility, which is preinstalled by default on many Linux distributions.\n\n### 3.2.3. Training and validation\n\n**Training:**\nTo train memory tokens for a given dataset, you need to execute `vit_train.py` script with arguments. \nThe usage is as follows:\n```bash\npython3 vit_train.py --dataset {CIFAR100/INaturalist/Places/Sun}\n                     --directory {directory_for_datasets}\n                     --epochs {number_of_epochs}\n                     --batch_size {batch_size}\n                     --number_of_memory_tokens {number_of_memory_tokens}\n```\n\n`dataset` is the option for dataset, `directory` is the option for where you want to download the datasets which are huge, so keep in mind that. `epochs`, `batch_size` and `number_of_memory_tokens` are the options for hyperparameters.\n\n**Validation:**\nTo validate your model with different datasets, you need to execute `vit_validation.py` script with arguments.\nThe usage is like this:\n```bash\npython3 vit_validation.py --models-list {CIFAR100/INaturalist/Places/Sun}\n                          --directory {directory_for_datasets}\n                          --batch_size {batch_size}\n                          --number_of_memory_tokens {number_of_memory_tokens}\n```\n\n`models_list` is the option for list of models, you can give all 4 models as `CIFAR100 INaturalist Places Sun`. For example, if you want to validate the model which concatenate all 4 memory tokens at once, you have to run the below command.\n\n```bash\npython3 vit_validation.py --models-list CIFAR100 INaturalist Places Sun\n```\n\n`directory` is the option for where you want to download or already downloaded the datasets. `batch_size` and `number_of_memory_tokens` are the options for hyperparameters, but keep in mind that `number_of_memory_tokens` should be same with the model training value.\n\n### 3.2.4. Unit tests\n\n[`test_vit.py`](test_vit.py) contains some unit tests for checking the correctness of our implementation. It's intended to be run using [PyTest](https://docs.pytest.org/en/7.1.x/contents.html).\n\nTo get started, make sure that `pytest` is installed and up to date:\n```bash\npip3 install -U pytest\n```\n\nIf the installation was successful, simply running `pytest` should run the unit tests:\n```bash\npytest\n```\nNote that the unit tests need to download [the ViT-B/32 base model](https://huggingface.co/google/vit-base-patch32-224-in21k) like other scripts.\nHowever, the unit tests will download the model into the `~/ceng502` folder, and unfortunately it's not possible to configure this as in other scripts.\nAlternatively, you can change the `home_dir` variable directly in `test_vit.py`.\n\n\n## 3.3. Results\n\nAs we stated the datasets are tremendous, so we could not conduct experiments for every experiments that the paper has implemented. We only tried 1 memory cell (token) model for the 4 datasets. In addition to that, as we stated we have faced some resource problems in terms of time and memory constraints. Moreover, some datasets do not have train and validation partitions and the paper did not share their splitting logic, so we just randomly splitted the dataset.\nEven though there are some limitations as we stated, we still get comparable results with the paper implementation. In addition to that, we even get better accuracy for SUN-397 and i-Naturalist datasets which do not have train and validation set by default, so one of the reason why we get better accuracy is the randomly split of the datasets. The accuracy values of the paper can be seen below.\n\u003cp align=\"center\"\u003e\u003cimg src=\"images/results_from_paper.png\" width=\"700\"\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003ci\u003eTable 2. Accuracy for different datasets for each fine-tuning regime of the paper (Borrowed from the paper.)\u003c/i\u003e\u003c/p\u003e\n\nWe only compare the results with the column 1 cells which is the result for finetuning 1 memory token models.\nHowever, anyone can conduct experiments with different number of memory cells by giving it as an argument to training and validation scripts.\nThe our implementation accuracy values can be seen in the below table.\n\u003cp align=\"center\"\u003e\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eDataset\u003c/th\u003e\n    \u003cth\u003eValidation Result\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \t\u003ctd\u003eSUN-397\u003c/td\u003e\n    \u003ctd\u003e80.96\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \t\u003ctd\u003eiNaturalist\u003c/td\u003e\n    \u003ctd\u003e58.70\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n  \t\u003ctd\u003eCIFAR-100\u003c/td\u003e\n    \u003ctd\u003e64.08\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003ePlaces-365\u003c/td\u003e\n    \u003ctd\u003e50.47\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\u003ci\u003eTable 3. Validation results for different datasets from our implementation. (Our result.)\u003c/i\u003e\n\u003c/p\u003e\n\nAs we can easily see, we have similar results with the results of the paper implementation. Only CIFAR-100 dataset has significantly worse result than paper. The reasons that we thought are the number of steps we chose and relatively small dataset. We only trained for 15640 steps because of the time constraints, so 15640 steps may be a little bit small number.\n\nAn important side note is in our script we do not multiply accuracy values with 100, so you will see floating point numbers for the accuracy values. You have to multiply these values with 100 to get accuracy percentage.\n\n\n# 4. Conclusion\n\nThe method that the paper proposed integrates learnable memory tokens into each self-attention layer, presenting a novel approach to model fine-tuning. Moreover, the attention masking strategy ensures that the model's performance on previous datasets remains unaffected. In our implementation, we couldn't conduct every experiment that the paper examined due to time constraints, but we were able to get comparable results with 1 memory token, even though we used a smaller number of steps for datasets. Moreover, we only conducted experiments for 1 memory cell, but anyone who uses our implementation can conduct experiments with the number of memory tokens of his/her choosing by providing it as an argument to the scripts. We didn't fully reproduce the results as we used a different number of batch sizes, random splits over some datasets, and randomness in the memory token initializations, but the results showed that we can achieve the fine-tuning memory token benefits in our implementation.\n\n# 5. References\n\n- Sandler, M., Zhmoginov, A., Vladymyrov, M., \u0026 Jackson, A. (2022, March 30). [**Fine-tuning image transformers using learnable memory.**](https://arxiv.org/abs/2203.15243) arXiv.org. https://arxiv.org/abs/2203.15243\n- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., \u0026 Polosukhin, I. (2017, December 6). [**Attention is all you need.**](https://arxiv.org/abs/1706.03762) arXiv.org. https://arxiv.org/abs/1706.03762\n- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., \u0026 Houlsby, N. (2021, June 3). [**An image is worth 16x16 words: Transformers for image recognition at scale.**](https://arxiv.org/abs/2010.11929) arXiv.org. https://arxiv.org/abs/2010.11929\n- Burtsev, M. S., Kuratov, Y., Peganov, A., \u0026 Sapunov, G. V. (2021, February 16). [**Memory transformer.**](https://arxiv.org/abs/2006.11527) arXiv.org. https://arxiv.org/abs/2006.11527\n- Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., \u0026 Rabinovich, A. (2015). [**Going deeper with convolutions.**](https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Szegedy_Going_Deeper_With_2015_CVPR_paper.html) In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).\n- Krizhevsky, A., \u0026 Hinton, G. (2009). [**Learning multiple layers of features from tiny images.**](https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf) Technical report, University of Toronto.\n- Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., \u0026 Belongie, S. (2018, June). [**The iNaturalist species classification and detection dataset.**](https://arxiv.org/abs/1707.06642) In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).\n- Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., \u0026 Torralba, A. (2017). [**Places: A 10 million image database for scene recognition.**](https://arxiv.org/abs/1610.02055) IEEE transactions on pattern analysis and machine intelligence.\n- Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., \u0026 Torralba, A. (2010, June). [**Sun database: Large-scale scene recognition from abbey to zoo.**](https://vision.princeton.edu/projects/2010/SUN/) In CVPR (pp. 3485-3492).\n\n\n\n# Contact\n\nYou can contact us by opening an issue in this repository, which is preferred for matters related to this implementation.\nAlternatively, you can send an e-mail to us:\n- İlker Işık, e238051@metu.edu.tr\n- Ege Berk Büyükbaş, ege.buyukbas@metu.edu.tr\n\n# License\n\nAll original code we wrote in this repository is licensed under [the MIT License](./LICENSE).\n\nEverything except for the following is our original work:\n- The implementation in [`vit.py`](vit.py) is based on the [ViT implementation of the HuggingFace library](https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py), which is licensed under [the Apache 2.0 License](http://www.apache.org/licenses/LICENSE-2.0). It contains our contributions as well as the base code from the HuggingFace library.\n- All pictures in the [images](./images) directory are borrowed from the paper. These images belong to their copyright holders. They are provided in this repository for educational purposes only, which constitute fair use under the US copyright law.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnecrashter%2Ftransformers-learnable-memory","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnecrashter%2Ftransformers-learnable-memory","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnecrashter%2Ftransformers-learnable-memory/lists"}