{"id":22615493,"url":"https://github.com/sovit-123/vision_transformers","last_synced_at":"2025-09-12T12:48:16.689Z","repository":{"id":63773386,"uuid":"570358062","full_name":"sovit-123/vision_transformers","owner":"sovit-123","description":"Vision Transformers for image classification, image segmentation, and object detection.","archived":false,"fork":false,"pushed_at":"2024-10-17T15:21:34.000Z","size":46195,"stargazers_count":50,"open_issues_count":1,"forks_count":9,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-19T07:04:36.761Z","etag":null,"topics":["attention","computer-vision","transformer-models","transformers","vision-transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sovit-123.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-25T01:41:21.000Z","updated_at":"2025-04-23T04:19:32.000Z","dependencies_parsed_at":"2023-02-15T04:16:40.314Z","dependency_job_id":"dbdb98ac-c991-4fb8-90da-3d1ea9cce653","html_url":"https://github.com/sovit-123/vision_transformers","commit_stats":{"total_commits":72,"total_committers":2,"mean_commits":36.0,"dds":0.05555555555555558,"last_synced_commit":"02f349e500ac19ce4e449785fac96c973cefe9ae"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sovit-123/vision_transformers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovit-123%2Fvision_transformers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovit-123%2Fvision_transformers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovit-123%2Fvision_transformers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovit-123%2Fvision_transformers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sovit-123","download_url":"https://codeload.github.com/sovit-123/vision_transformers/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sovit-123%2Fvision_transformers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274815792,"owners_count":25355211,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-12T02:00:09.324Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","computer-vision","transformer-models","transformers","vision-transformer"],"created_at":"2024-12-08T19:07:43.578Z","updated_at":"2025-09-12T12:48:16.639Z","avatar_url":"https://github.com/sovit-123.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vision_transformers\n\n***A repository for everything Vision Transformers.***\n\n![](readme_images/detr_infer.gif)\n\n## Currently Supported Models\n\n- Image Classification\n\n  - ViT Base Patch 16 | 224x224: Torchvision pretrained weights\n  - ViT Base Patch 32 | 224x224: Torchvision pretrained weights\n  - ViT Tiny Patch 16 | 224x224: Timm pretrained weights\n  - Vit Tiny Patch 16 | 384x384: Timm pretrained weights\n  - Swin Transformer Tiny Patch 4 Window 7 | 224x224: Official Microsoft weights\n  - Swin Transformer Small Patch 4 Window 7 | 224x224: Official Microsoft weights\n  - Swin Transformer Base Patch 4 Window 7 | 224x224: Official Microsoft weights\n  - Swin Transformer Large Patch 4 Window 7 | 224x224: No pretrained weights\n  - MobileViT S\n  - MobileViT XS\n  - MobileVit XXS\n- Object Detection\n  - DETR ResNet50 (COCO pretrained)\n  - DETR ResNet50 DC5 (COCO pretrained)\n  - DETR ResNet101 (COCO pretrained)\n  - DETR ResNet101 DC5 (COCO pretrained)\n\n## GO TO\n\n* [Quick Setup](#Quick-Setup)\n* [Importing Models and Usage](#Importing-Models-and-Usage)\n* [DETR Video Inference Commands (COCO pretrained models)](#DETR-Video-Inference-Commands-(COCO-pretrained-models))\n* [Examples](#Examples)\n\n\n## Quick Setup\n\n### Stable PyPi Package\n\n```bash\npip install vision-transformers\n```\n\n### OR\n\n### Latest Git Updates\n\n```bash\ngit clone https://github.com/sovit-123/vision_transformers.git\ncd vision_transformers\n```\n\nInstallation in the environment of your choice:\n\n```bash\npip install .\n```\n\n## Importing Models and Usage\n\n### If you have you own training pipeline and just want the model\n\n**Replace `num_classes=1000`** **with you own number of classes**.\n\n```python\nfrom vision_transformers.models import vit\n\nmodel = vit.vit_b_p16_224(num_classes=1000, pretrained=True)\n# model = vit.vit_b_p32_224(num_classes=1000, pretrained=True)\n# model = vit.vit_ti_p16_224(num_classes=1000, pretrained=True)\n```\n\n```python\nfrom vision_transformers.models import swin_transformer\n\nmodel = swin_transformer.swin_t_p4_w7_224(num_classes=1000, pretrained=True)\n# model = swin_transformer.swin_s_p4_w7_224(num_classes=1000, pretrained=True)\n# model = swin_transformer.swin_b_p4_w7_224(num_classes=1000, pretrained=True)\n# model = swin_transformer.swin_l_p4_w7_224(num_classes=1000)\n```\n\n### If you want to use the training pipeline\n\n* Clone the repository:\n\n```bash\ngit clone https://github.com/sovit-123/vision_transformers.git\ncd vision_transformers\n```\n\n* Install\n\n```bash\npip install .\n```\n\nFrom the `vision_transformers` directory:\n\n* If you have no validation split\n\n```bash\npython tools/train_classifier.py --data data/diabetic_retinopathy/colored_images/ 0.15 --epochs 5 --model vit_ti_p16_224\n```\n\n* In the above command:\n\n  * `data/diabetic_retinopathy/colored_images/` represents the data folder where the images will be inside the respective class folders\n\n  * `0.15` represents the validation split as the dataset does not contain a validation folder\n\n* If you have validation split\n\n```bash\npython tools/train_classifier.py --train-dir data/plant_disease_recognition/train/ --valid-dir data/plant_disease_recognition/valid/ --epochs 5 --model vit_ti_p16_224\n```\n\n* In the above command:\n  * `--train-dir` should be path to the training directory where the images will be inside their respective class folders.\n  * `--valid-dir` should be path to the validation directory where the images will be inside their respective class folders.\n\n### All Available Model Flags for `--model`\n\n```\nvit_b_p32_224\nvit_ti_p16_224\nvit_ti_p16_384\nvit_b_p16_224\nswin_b_p4_w7_224\nswin_t_p4_w7_224\nswin_s_p4_w7_224\nswin_l_p4_w7_224\nmobilevit_s\nmobilevit_xs\nmobilevit_xxs\n```\n\n### DETR Training\n\n* The datasets annotations should be in XML format. The dataset (according to `--data` flag) given in following can be found here =\u003e https://www.kaggle.com/datasets/sovitrath/aquarium-data\n\n```bash\npython tools/train_detector.py --model detr_resnet50 --epochs 2 --data data/aquarium.yaml\n```\n\n### DETR Image Inference (using trained weights)\n\nReplace weights and input file path as per your requirement.\n\n```bash\npython tools/inference_image_detect.py --weights runs/training/res_1/best_model.pth --input image.jpg\n```\n\nYou can also provide the path to a directory to run inference on all images in that directory.\n\n```bash\npython tools/inference_image_detect.py --weights runs/training/res_1/best_model.pth --input image_directory\n```\n\n### DETR Video Inference (using trained weights)\n\nReplace weights and input file path as per your requirement. You can add `--show` to the command to visualize the detection on screen.\n\n```bash\npython tools/inference_video_detect.py --weights runs/training/res_1/best_model.pth --input video.mp4\n```\n\n## DETR Video Inference Commands (COCO pretrained models)\n\n***All commands to be executed from the root project directory (`vision_transformers`)***\n\n```bash\npython tools/inference_video_detect.py --model detr_resnet50 --show --input example_test_data/video_1.mp4\n                                               detr_resnet50_dc5            \u003cpath/to/your/file\u003e\n                                               detr_resnet101               \n                                               detr_resnet101_dc5\n```\n\n### Tracking using COCO Pretrained Weights\n\n```bash\n# Track all COCO classes.\npython tools/inference_video_detect.py --track --model detr_resnet50 --show --input example_test_data/video_1.mp4\n                                                       detr_resnet50_dc5            \u003cpath/to/your/file\u003e\n                                                       detr_resnet101               \n                                                       detr_resnet101_dc5\n\n# Track only person class (for DETR, object indices start from 2 for COCO pretrained models). Check `data/test_video_config.yaml` for more information.\npython tools/inference_video_detect.py --track --model detr_resnet50 --show --input ../inference_data/video_4.mp4 --classes 2\n\n# Track person and motocycle classes (for DETR, object indices start from 2 for COCO pretrained models). Check `data/test_video_config.yaml` for more information.\npython tools/inference_video_detect.py --track --model detr_resnet50 --show --input ../inference_data/video_4.mp4 --classes 2 5\n```\n\n### Tracking using Custom Trained Weights\n\nJust provide the path to the trained weights instead of a model.\n\n```bash\npython tools/inference_video_detect.py --track --weights runs/training/res_1/best_model.pth --show --input ../inference_data/video_4.mp4\n```\n\n## [Examples](https://github.com/sovit-123/vision_transformers/tree/main/examples)\n\n- [ViT Base 16 | 224x224 pretrained fine-tuning on CIFAR10](https://github.com/sovit-123/vision_transformers/blob/main/examples/cifar10_vit_pretrained.ipynb)\n- [ViT Tiny 16 | 224x224 pretrained fine-tuning on CIFAR10](https://github.com/sovit-123/vision_transformers/blob/main/examples/cifar10_vit_tiny_p16_224.ipynb)\n- [DETR image inference notebook](https://github.com/sovit-123/vision_transformers/blob/main/examples/detr_image_inference.ipynb)\n- [DETR video inference script](https://github.com/sovit-123/vision_transformers/blob/main/examples/detr_video_inference.py) (**Fine Tuning Coming Soon**) --- [Check commands here](#DETR-Video-Inference-Commands)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsovit-123%2Fvision_transformers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsovit-123%2Fvision_transformers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsovit-123%2Fvision_transformers/lists"}