{"id":13869863,"url":"https://github.com/Alibaba-MIIL/STAM","last_synced_at":"2025-07-15T18:32:11.288Z","repository":{"id":37687506,"uuid":"351482998","full_name":"Alibaba-MIIL/STAM","owner":"Alibaba-MIIL","description":"Official implementation of \"An Image is Worth 16x16 Words, What is a Video Worth?\"  (2021 paper) ","archived":false,"fork":false,"pushed_at":"2022-08-23T18:08:31.000Z","size":39,"stargazers_count":219,"open_issues_count":6,"forks_count":31,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-11-23T15:35:50.655Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Alibaba-MIIL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-25T15:20:58.000Z","updated_at":"2024-04-02T01:04:45.000Z","dependencies_parsed_at":"2022-09-15T10:12:46.804Z","dependency_job_id":null,"html_url":"https://github.com/Alibaba-MIIL/STAM","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Alibaba-MIIL/STAM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-MIIL%2FSTAM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-MIIL%2FSTAM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-MIIL%2FSTAM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-MIIL%2FSTAM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Alibaba-MIIL","download_url":"https://codeload.github.com/Alibaba-MIIL/STAM/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alibaba-MIIL%2FSTAM/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265451456,"owners_count":23767769,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-05T20:01:20.039Z","updated_at":"2025-07-15T18:32:11.020Z","avatar_url":"https://github.com/Alibaba-MIIL.png","language":"Python","funding_links":[],"categories":["Python","其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# An Image is Worth 16x16 Words, What is a Video Worth?\n\n\u003c!-- [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/an-image-is-worth-16x16-words-what-is-a-video/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=an-image-is-worth-16x16-words-what-is-a-video) --\u003e\n\n[paper](https://arxiv.org/pdf/2103.13915.pdf) \n\nOfficial PyTorch Implementation\n\n\u003e Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor\u003cbr/\u003e\n\u003e DAMO Academy, Alibaba Group\n\n\n\n**Abstract**\n\n\u003e Leading methods in the domain of action recognition try to\ndistill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the\nArt (SotA) accuracy, usually make use of 3D convolution\nlayers as a way to abstract the temporal information from\nvideo frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a\ncollection of closely sampled frames. Since each short clip\ncovers a small fraction of an input video, multiple clips are\nsampled at inference in order to cover the whole temporal\nlength of the video. This leads to increased computational\nload and is impractical for real-world applications. We address the computational bottleneck by significantly reducing\nthe number of frames required for inference. Our approach\nrelies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient\ninformation in each frame. Therefore our approach is very\ninput efficient, and can achieve SotA results (on Kinetics\ndataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach\n78.8 top-1 accuracy with ×30 less frames per video, and\n×40 faster inference than the current leading method\n\u003e\n\n## Update 2/5/2021:  Improved results\nDue to improved training hyperparameters, and using KD training, we were able to improve\n STAM results on Kinetics400 (+ ~1.5%).  We are releasing the pretrained weights of the improved\n  models (see Pretrained Models below). \n\n## Main Article Results\n\nSTAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were\n done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.\n\u003cp align=\"center\"\u003e\n \u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003eModels\u003c/th\u003e\n    \u003cth\u003eTop-1 Accuracy \u003cbr\u003e(%)\u003c/th\u003e\n    \u003cth\u003eFlops × views\u003cbr\u003e(10^9)\u003c/th\u003e\n    \u003cth\u003e# Input Frames\u003c/th\u003e\n    \u003cth\u003eRuntime\u003cbr\u003e(Videos/sec)\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eX3D-M\u003c/td\u003e\n    \u003ctd\u003e76.0\u003c/td\u003e\n    \u003ctd\u003e6.2 × 30 \u003c/td\u003e\n    \u003ctd\u003e480\u003c/td\u003e\n    \u003ctd\u003e1.3\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eX3D-L\u003c/td\u003e\n    \u003ctd\u003e77.5\u003c/td\u003e\n    \u003ctd\u003e24.8 × 30\u003c/td\u003e\n    \u003ctd\u003e480\u003c/td\u003e\n    \u003ctd\u003e0.46\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eX3D-XL\u003c/td\u003e\n    \u003ctd\u003e79.1\u003c/td\u003e\n    \u003ctd\u003e48.4 × 30\u003c/td\u003e\n    \u003ctd\u003e480\u003c/td\u003e\n    \u003ctd\u003eN/A\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eX3D-XXL\u003c/td\u003e\n    \u003ctd\u003e80.4\u003c/td\u003e\n    \u003ctd\u003e194 × 30\u003c/td\u003e\n    \u003ctd\u003e480\u003c/td\u003e\n    \u003ctd\u003eN/A\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTimeSformer-L\u003c/td\u003e\n    \u003ctd\u003e80.7\u003c/td\u003e\n    \u003ctd\u003e2380 × 3\u003c/td\u003e\n    \u003ctd\u003e 288 \u003c/td\u003e\n    \u003ctd\u003eN/A\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eViViT-L\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e81.3\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e3992 × 12\u003c/td\u003e\n    \u003ctd\u003e384\u003c/td\u003e\n    \u003ctd\u003eN/A\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSTAM-8\u003c/td\u003e\n    \u003ctd\u003e77.5\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e135 × 1\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e8\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e---\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSTAM-16\u003c/td\u003e\n    \u003ctd\u003e79.3\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e270 × 1\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e16\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e20.0\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSTAM-32\u003c/td\u003e\n    \u003ctd\u003e79.95\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e540 × 1\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e32\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e---\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSTAM-64\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e80.5\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003e1080 × 1\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e64\u003c/td\u003e\n    \u003ctd\u003e4.8\u003c/td\u003e\n  \u003c/tr\u003e\n \u003c/table\u003e\n\u003c/p\u003e\n\n## Pretrained Models\n\nWe provide a collection of STAM models pre-trained on Kinetics400. \n\n| Model name  | checkpoint\n| ------------ | :--------------: |\n| STAM_8 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_8.pth) |\n| STAM_16 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_16.pth) |\n| STAM_32 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_32.pth) |\n| STAM_64 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_64.pth) |\n\n\n## Reproduce Article Scores\nWe provide code for reproducing the validation top-1 score of STAM\nmodels on Kinetics400. First, download pretrained models from the links above.\n\nThen, run the infer.py script. For example, for stam_16 (input size 224)\nrun:\n```bash\npython -m infer \\\n--val_dir=/path/to/kinetics_val_folder \\\n--model_path=/model/path/to/stam_16.pth \\\n--model_name=stam_16\n--input_size=224\n```\n\n\n## Citations\n\n```bibtex\n@misc{sharir2021image,\n    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, \n    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},\n    year    = {2021},\n    eprint  = {2103.13915},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CV}\n}\n```\n\n## Acknowledgements\n\nWe thank Tal Ridnik for discussions and comments.\n\nSome components of this code implementation are adapted from the excellent\n[repository of Ross Wightman](https://github.com/rwightman/pytorch-image-models). Check it out and give it a star while\nyou are at it.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlibaba-MIIL%2FSTAM","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAlibaba-MIIL%2FSTAM","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAlibaba-MIIL%2FSTAM/lists"}