{"id":14970648,"url":"https://github.com/bwconrad/flexivit","last_synced_at":"2025-10-26T13:31:10.289Z","repository":{"id":81283557,"uuid":"606165490","full_name":"bwconrad/flexivit","owner":"bwconrad","description":"PyTorch reimplementation of FlexiViT: One Model for All Patch Sizes","archived":false,"fork":false,"pushed_at":"2024-05-05T14:15:10.000Z","size":463,"stargazers_count":45,"open_issues_count":1,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-09-28T13:22:24.638Z","etag":null,"topics":["computer-vision","pytorch","transformer","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bwconrad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-24T18:47:01.000Z","updated_at":"2024-09-26T09:25:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"5b4cbbb4-5f7e-4119-b564-f0e8ff4bfb72","html_url":"https://github.com/bwconrad/flexivit","commit_stats":{"total_commits":30,"total_committers":2,"mean_commits":15.0,"dds":0.09999999999999998,"last_synced_commit":"77fc29a7bec6d5147ff0c8028e7fddac510cd19b"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwconrad%2Fflexivit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwconrad%2Fflexivit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwconrad%2Fflexivit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bwconrad%2Fflexivit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bwconrad","download_url":"https://codeload.github.com/bwconrad/flexivit/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219862887,"owners_count":16555951,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","pytorch","transformer","vision-transformer"],"created_at":"2024-09-24T13:43:55.576Z","updated_at":"2025-10-26T13:31:04.919Z","avatar_url":"https://github.com/bwconrad.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# FlexiViT\n\nPyTorch reimplementation of [\"FlexiViT: One Model for All Patch Sizes\"](https://arxiv.org/abs/2212.08013).\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/bwconrad/flexivit/main/assets/flexi.png\" width=\"50%\" style={text-align: center;}/\u003e\n\u003c/p\u003e\n\n## Installation\n\n```\npip install flexivit-pytorch\n```\n\nOr install the entire repo with:\n\n```\ngit clone https://github.com/bwconrad/flexivit\ncd flexivit/\npip install -r requirements.txt\n```\n\n## Usage\n\n#### Basic Usage\n```python\nimport torch\nfrom flexivit_pytorch import FlexiVisionTransformer\n\nnet = FlexiVisionTransformer(\n    img_size=240,\n    base_patch_size=32,\n    patch_size_seq=(8, 10, 12, 15, 16, 20, 14, 30, 40, 48),\n    base_pos_embed_size=7,\n    num_classes=1000,\n    embed_dim=768,\n    depth=12,\n    num_heads=12,\n    mlp_ratio=4,\n)\n\nimg = torch.randn(1, 3, 240, 240)\npreds = net(img)\n```\n\nYou can also initialize default network configurations:\n\n```python\nfrom flexivit_pytorch import (flexivit_base, flexivit_huge, flexivit_large,\n                              flexivit_small, flexivit_tiny)\n\nnet = flexivit_tiny()\nnet = flexivit_small()\nnet = flexivit_base()\nnet = flexivit_large()\nnet = flexivit_huge()\n```\n\n#### Resizing Pretrained Model Weights\n\nThe patch embedding layer of a standard pretrained vision transformer can be resized to any patch size using the `pi_resize_patch_embed()` function. A example \ndoing this with the `timm` library is the following:\n\n```python\nfrom timm import create_model\nfrom timm.layers.pos_embed import resample_abs_pos_embed\n\nfrom flexivit_pytorch import pi_resize_patch_embed\n\n# Load the pretrained model's state_dict\nstate_dict = create_model(\"vit_base_patch16_224\", pretrained=True).state_dict()\n\n# Resize the patch embedding\nnew_patch_size = (32, 32)\nstate_dict[\"patch_embed.proj.weight\"] = pi_resize_patch_embed(\n    patch_embed=state_dict[\"patch_embed.proj.weight\"], new_patch_size=new_patch_size\n)\n\n# Interpolate the position embedding size\nimage_size = 224\ngrid_size = image_size // new_patch_size[0]\nstate_dict[\"pos_embed\"] = resample_abs_pos_embed(\n    posemb=state_dict[\"pos_embed\"], new_size=[grid_size, grid_size]\n)\n\n# Load the new weights into a model with the target image and patch sizes\nnet = create_model(\n    \"vit_base_patch16_224\", img_size=image_size, patch_size=new_patch_size\n)\nnet.load_state_dict(state_dict, strict=True)\n```\n\n##### Conversion Script\n`convert_patch_embed.py` can similarity do the resizing on any local model checkpoint file. For example, to resize to a patch size of 20:\n```\npython convert_patch_embed.py -i vit-16.pt -o vit-20.pt -n patch_embed.proj.weight -ps 20 \n```\nor to a patch size of height 10 and width 15:\n```\npython convert_patch_embed.py -i vit-16.pt -o vit-10-15.pt -n patch_embed.proj.weight -ps 10 15\n```\n- The `-n` argument should correspond to the name of the patch embedding weights in the checkpoint's state dict.\n\n### Evaluating at Different Patch Sizes\n`eval.py` can be used to evaluate pretrained Vision Transformer models at different patch sizes. For example, to evaluate a ViT-B/16 at a patch size of 20 on the ImageNet-1k validation set, you can run:\n```\npython eval.py --accelerator gpu --devices 1 --precision 16 --model.resize_type pi\n--model.weights vit_base_patch16_224.augreg_in21k_ft_in1k --data.root path/to/val/data/\n--data.num_classes 1000 --model.patch_size 20 --data.size 224 --data.crop_pct 0.9 \n--data.mean \"[0.5,0.5,0.5]\" --data.std \"[0.5,0.5,0.5]\" --data.batch_size 256\n```\n- `--model.weights` should correspond to a `timm` model name.\n- The `--data.root` directory should be organized in the [TorchVision ImageFolder](https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html) structure. Alternatively, an LMDB file can be used by setting `--data.is_lmdb True` and having `--data.root` point to the `.lmdb` file.\n- To accurately compare to `timm`'s [baseline results](https://github.com/huggingface/pytorch-image-models/blob/main/results/results-imagenet.csv), make sure that \n`--data.size`, `--data.crop_pct`, `--data.interpolation` (all listed [here](https://github.com/huggingface/pytorch-image-models/blob/main/results/results-imagenet.csv)), `--data.mean`, and `--data.std` (in general found [here](https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/vision_transformer.py#L861)) are correct for the model. `--data.mean imagenet` and `--data.mean clip` can be set to use the respective default values (same for `--data.std`).\n- Run `python eval.py --help` for a list and descriptions for all arguments.\n    \n\n## Experiments\nThe following experiments test using PI-resizing to change the patch size of standard ViT models during evaluation. All models have been fine-tuned on ImageNet-1k with a fixed patch size and are evaluated with different patch sizes.\n\n#### Adjusting patch size and freezing image size to 224\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/bwconrad/flexivit/main/assets/ps.png\" width=\"100%\" style={text-align: center;}/\u003e\n\u003c/p\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eNumerical Results\u003c/summary\u003e\n\n| Patch Size | 8     | 10    | 12    | 13    | 14        | 15    | 16    | 18    | 20    | 24    | 28    | 32    | 36    | 40    | 44    | 48   |\n|:----------:|-------|-------|-------|-------|-----------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|------|\n| ViT-T/16   | 64.84 | 72.54 | 75.18 | 75.80 | __76.06__ | 75.30 | 75.46 | 73.41 | 71.67 | 64.26 | 54.48 | 36.10 | 13.58 | 5.09  | 4.93  | 2.70 |\n| ViT-S/16   | 76.31 | 80.24 | 81.56 | 81.76 | __81.93__ | 81.31 | 81.41 | 80.22 | 78.91 | 73.61 | 66.99 | 51.38 | 22.34 | 8.78  | 8.49  | 4.03 |\n| ViT-B/16   | 79.97 | 83.41 | 84.33 | 84.70 | __84.87__ | 84.38 | 84.53 | 83.56 | 82.77 | 78.65 | 73.28 | 58.92 | 34.61 | 14.81 | 14.66 | 5.11 |\n\n\n| Patch Size | 8     | 12    | 16    | 20    | 24    | 28    | 30    | 31    | 32        | 33    | 34    | 36    | 40    | 44    | 48    |\n|:----------:|-------|-------|-------|-------|-------|-------|-------|-------|-----------|-------|-------|-------|-------|-------|-------|\n| ViT-B/32   | 44.06 | 69.65 | 78.16 | 81.42 | 83.06 | 82.98 | 83.00 | 82.86 | __83.30__ | 80.34 | 80.82 | 80.98 | 78.24 | 78.72 | 72.14 | \n\u003c/details\u003e\n\n#### Adjusting patch and image size\n- Maintaining the same number of tokens as during training\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/bwconrad/flexivit/main/assets/ps-is.png\" width=\"100%\" style={text-align: center;}/\u003e\n\u003c/p\u003e\n\n\n\u003cdetails\u003e\n\u003csummary\u003eNumerical Results\u003c/summary\u003e\n\n| Patch Size / Image Size | 4 / 56 | 8 / 112 | 16 / 224 | 32 / 224 | 64 / 896 |\n|:-----------------------:|--------|---------|----------|----------|----------|\n| ViT-T/16                | 29.81  | 65.39   | 75.46    | 75.34    | 75.25    |\n| ViT-S/16                | 50.68  | 74.43   | 81.41    | 81.31    | 81.36    |\n| ViT-B/16                | 59.51  | 78.90   | 84.54    | 84.29    | 84.40    |\n| ViT-L/16                | 69.44  | 82.08   | 85.85    | 85.70    | 85.77    | \n\u003c/details\u003e\n\n\n\n## Citation\n```bibtex\n@article{beyer2022flexivit,\n  title={FlexiViT: One Model for All Patch Sizes},\n  author={Beyer, Lucas and Izmailov, Pavel and Kolesnikov, Alexander and Caron, Mathilde and Kornblith, Simon and Zhai, Xiaohua and Minderer, Matthias and Tschannen, Michael and Alabdulmohsin, Ibrahim and Pavetic, Filip},\n  journal={arXiv preprint arXiv:2212.08013},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbwconrad%2Fflexivit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbwconrad%2Fflexivit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbwconrad%2Fflexivit/lists"}