{"id":26911350,"url":"https://github.com/kimrass/vit","last_synced_at":"2025-04-01T14:37:59.958Z","repository":{"id":188503658,"uuid":"678791180","full_name":"KimRass/ViT","owner":"KimRass","description":"PyTorch implementation of 'ViT' (Dosovitskiy et al., 2020) and training it on CIFAR-10 and CIFAR-100","archived":false,"fork":false,"pushed_at":"2024-05-02T08:54:44.000Z","size":41798,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-05-02T22:03:57.094Z","etag":null,"topics":["cifar10","cifar100","cutmix","dropblock","hide-and-seek","imagenet1k","mixup","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KimRass.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-08-15T11:41:59.000Z","updated_at":"2024-05-02T08:54:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"a56d104a-fd33-43e4-9535-0093991771a7","html_url":"https://github.com/KimRass/ViT","commit_stats":null,"previous_names":["kimrass/vit_from_scratch","kimrass/vit"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KimRass%2FViT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KimRass%2FViT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KimRass%2FViT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KimRass%2FViT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KimRass","download_url":"https://codeload.github.com/KimRass/ViT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246656774,"owners_count":20812884,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cifar10","cifar100","cutmix","dropblock","hide-and-seek","imagenet1k","mixup","vision-transformer"],"created_at":"2025-04-01T14:37:59.258Z","updated_at":"2025-04-01T14:37:59.943Z","avatar_url":"https://github.com/KimRass.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 1. Pre-trained Models\n```python\nDROP_PROB = 0.1\nN_LAYERS = 6\nHIDDEN_SIZE = 384\nMLP_SIZE = 384\nN_HEADS = 12\nPATCH_SIZE = 4\nBASE_LR = 1e-3\nBETA1 = 0.9\nBETA2 = 0.999\nWEIGHT_DECAY = 5e-5\nWARMUP_EPOCHS = 5\nSMOOTHING = 0.1\nCUTMIX = False\nCUTOUT = False\nHIDE_AND_SEEK = False\nBATCH_SIZE = 2048\nN_EPOCHS = 300\n```\n## 1) Trained on CIFAR-10 Dataset for 300 Epochs\n- [vit_cifar10.pth](https://drive.google.com/file/d/1NkMB-WIDIwLIs-DvIxI39-K4TgQFq-nL/view?usp=sharing)\n- Top-1 accuracy 0.864 on validation set\n## 2) Trained on CIFAR-100 Dataset for 256 Epochs\n- [vit_cifar100.pth](https://drive.google.com/file/d/1vxH9c1q2BbHiFRN8JSlu3zj7ZBPvQYR8/view?usp=sharing)\n- Top-1 accuracy 0.447 on validation set\n\n# 2. Implementation Details\n- `F.gelu()` → `nn.Dropout()`의 순서가 되도록 Architecture를 변경했습니다. 순서를 반대로 할 경우 미분 값이 0이 되어 학습이 이루어지지 않는 현상이 발생함을 확인했습니다.\n- CIFAR-100에 대해서 `N_LAYERS = 6, HIDDEN_SIZE = 384, N_HEADS = 6`일 때, `PATCH_SIZE = 16`일 때보다 `PATCH_SIZE = 8`일 때, 그리고 `PATCH_SIZE = 4`일 때 성능이 향상됐습니다.\n- CIFAR-10과 CIFAR-100에 대해서 공통적으로 ViT-Base보다 작은 크기의 모델을 사용할 때 성능이 더 높았습니다.\n\n# 3. Studies\n## 1) Attention Map\n- Original image\n    - \u003cimg src=\"https://github.com/KimRass/ViT/assets/67457712/e2088a4c-8a5f-4193-ac72-2f4b2ede2928\" width=\"500\"\u003e\n- head_fusion: \"max\", discard_ratio: 0.85\n    - \u003cimg src=\"https://github.com/KimRass/ViT/assets/67457712/2b3f1ec6-aa2d-4980-b29c-3d90edaa1909\" width=\"500\"\u003e\n## 2) Position Embedding Similarity\n- \u003cimg src=\"https://github.com/KimRass/ViT/assets/67457712/be0efc06-a4d8-4da7-8a11-ed6730da2994\" width=\"500\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkimrass%2Fvit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkimrass%2Fvit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkimrass%2Fvit/lists"}