{"id":13958474,"url":"https://github.com/facebookresearch/Motionformer","last_synced_at":"2025-07-21T00:30:53.332Z","repository":{"id":37941059,"uuid":"373427173","full_name":"facebookresearch/Motionformer","owner":"facebookresearch","description":"Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers","archived":true,"fork":false,"pushed_at":"2022-06-13T23:03:22.000Z","size":1489,"stargazers_count":227,"open_issues_count":8,"forks_count":30,"subscribers_count":11,"default_branch":"main","last_synced_at":"2025-02-23T00:14:57.224Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facebookresearch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-03T07:59:07.000Z","updated_at":"2024-11-25T02:03:19.000Z","dependencies_parsed_at":"2022-07-12T17:03:56.653Z","dependency_job_id":null,"html_url":"https://github.com/facebookresearch/Motionformer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/facebookresearch/Motionformer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMotionformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMotionformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMotionformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMotionformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facebookresearch","download_url":"https://codeload.github.com/facebookresearch/Motionformer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facebookresearch%2FMotionformer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266221247,"owners_count":23894964,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T13:01:37.292Z","updated_at":"2025-07-21T00:30:52.678Z","avatar_url":"https://github.com/facebookresearch.png","language":"Python","funding_links":[],"categories":["其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# Motionformer\n\nThis is an official pytorch implementation of paper [Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers](https://arxiv.org/abs/2106.05392). In this repository, we provide PyTorch code for training and testing our proposed Motionformer model. Motionformer use proposed *trajectory attention* to achieve state-of-the-art results on several video action recognition benchmarks such as Kinetics-400 and Something-Something V2.\n\nIf you find Motionformer useful in your research, please use the following BibTeX entry for citation.\n\n```BibTeX\n@inproceedings{patrick2021keeping,\n   title={Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers}, \n   author={Mandela Patrick and Dylan Campbell and Yuki M. Asano and Ishan Misra Florian Metze and Christoph Feichtenhofer and Andrea Vedaldi and Jo\\ão F. Henriques},\n   year={2021},\n   booktitle={Advances in Neural Information Processing Systems (NeurIPS)},\n}\n```\n\n# Model Zoo\n\nWe provide Motionformer models pretrained on Kinetics-400 (K400), Kinetics-600 (K600), Something-Something-V2 (SSv2), and Epic-Kitchens datasets.\n\n| name | dataset | # of frames | spatial crop | acc@1 | acc@5 | url |\n| --- | --- | --- | --- | --- | --- | --- |\n| Joint | K400 | 16 | 224 | 79.2 | 94.2 | [model](https://dl.fbaipublicfiles.com/motionformer/k400_joint_224_16x4.pyth) |\n| Divided | K400 | 16 | 224 | 78.5 | 93.8 | [model](https://dl.fbaipublicfiles.com/motionformer/k400_divided_224_16x4.pyth) |\n| Motionformer | K400 | 16 | 224 | 79.7 | 94.2 | [model](https://dl.fbaipublicfiles.com/motionformer/k400_motionformer_224_16x4.pyth) |\n| Motionformer-HR | K400 | 16 | 336 | 81.1 | 95.2 | [model](https://dl.fbaipublicfiles.com/motionformer/k400_motionformer_336_16x8.pyth) |\n| Motionformer-L | K400 | 32 | 224 | 80.2 | 94.8 | [model](https://dl.fbaipublicfiles.com/motionformer/k400_motionformer_224_32x3.pyth) |\n\n| name | dataset | # of frames | spatial crop | acc@1 | acc@5 | url |\n| --- | --- | --- | --- | --- | --- | --- |\n| Motionformer | K600 | 16 | 224 | 81.6 | 95.6 | [model](https://dl.fbaipublicfiles.com/motionformer/k600_motionformer_224_16x4.pyth) |\n| Motionformer-HR | K600 | 16 | 336 | 82.7 | 96.1 | [model](https://dl.fbaipublicfiles.com/motionformer/k600_motionformer_336_16x8.pyth) |\n| Motionformer-L | K600 | 32 | 224 | 82.2 | 96.0 | [model](https://dl.fbaipublicfiles.com/motionformer/k600_motionformer_224_32x3.pyth) |\n\n| name | dataset | # of frames | spatial crop | acc@1 | acc@5 | url |\n| --- | --- | --- | --- | --- | --- | --- |\n| Joint | SSv2 | 16 | 224 | 64.0 | 88.4 | [model](https://dl.fbaipublicfiles.com/motionformer/ssv2_joint_224_16x4.pyth) |\n| Divided | SSv2 | 16 | 224 | 64.2 | 88.6 | [model](https://dl.fbaipublicfiles.com/motionformer/ssv2_divided_224_16x4.pyth) |\n| Motionformer | SSv2 | 16 | 224 | 66.5 | 90.1 | [model](https://dl.fbaipublicfiles.com/motionformer/ssv2_motionformer_224_16x4.pyth) |\n| Motionformer-HR | SSv2 | 16 | 336 | 67.1 | 90.6 | [model](https://dl.fbaipublicfiles.com/motionformer/ssv2_motionformer_336_16x4.pyth) |\n| Motionformer-L | SSv2 | 32 | 224 | 68.1 | 91.2 | [model](https://dl.fbaipublicfiles.com/motionformer/ssv2_motionformer_224_32x3.pyth) |\n\n| name | dataset | # of frames | spatial crop | A acc | N acc | url |\n| --- | --- | --- | --- | --- | --- | --- |\n| Motionformer | EK | 16 | 224 | 43.1 | 56.5 | [model](https://dl.fbaipublicfiles.com/motionformer/ek_motionformer_224_16x4.pyth) |\n| Motionformer-HR | EK | 16 | 336 | 44.5 | 58.5 | [model](https://dl.fbaipublicfiles.com/motionformer/ek_motionformer_336_16x4.pyth) |\n| Motionformer-L | EK | 32 | 224 | 44.1 | 57.6 | [model](https://dl.fbaipublicfiles.com/motionformer/ek_motionformer_224_32x3.pyth) |\n\n# Installation\n\nFirst, create a conda virtual environment and activate it:\n```\nconda create -n motionformer python=3.8.5 -y\nsource activate motionformer\n```\n\nThen, install the following packages:\n\n- torchvision: `pip install torchvision` or `conda install torchvision -c pytorch`\n- [fvcore](https://github.com/facebookresearch/fvcore/): `pip install 'git+https://github.com/facebookresearch/fvcore'`\n- simplejson: `pip install simplejson`\n- einops: `pip install einops`\n- timm: `pip install timm`\n- PyAV: `conda install av -c conda-forge`\n- psutil: `pip install psutil`\n- scikit-learn: `pip install scikit-learn`\n- OpenCV: `pip install opencv-python`\n- tensorboard: `pip install tensorboard`\n- matplotlib: `pip install matplotlib`\n- pandas: `pip install pandas`\n- ffmeg: `pip install ffmpeg-python`\n\nOR:\n\nsimply create conda environment with all packages just from yaml file:\n\n`conda env create -f environment.yml`\n\nLastly, build the Motionformer codebase by running:\n```\ngit clone https://github.com/facebookresearch/Motionformer\ncd Motionformer\npython setup.py build develop\n```\n\n# Usage\n\n## Dataset Preparation\n\nPlease use the dataset preparation instructions provided in [DATASET.md](slowfast/datasets/DATASET.md).\n\n## Training the Default Motionformer\n\nTraining the default Motionformer that uses trajectory attention, and operates on 16-frame clips cropped at 224x224 spatial resolution, can be done using the following command:\n\n```\npython tools/run_net.py \\\n  --cfg configs/K400/motionformer_224_16x4.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  NUM_GPUS 8 \\\n  TRAIN.BATCH_SIZE 8 \\\n```\nYou may need to pass location of your dataset in the command line by adding `DATA.PATH_TO_DATA_DIR path_to_your_dataset`, or you can simply modify\n\n```\nDATA:\n  PATH_TO_DATA_DIR: path_to_your_dataset\n```\n\nWe improved the trajectory attention from original code, and you can set the `VIT.USE_ORIGINAL_TRAJ_ATTN_CODE` flag to `False` to use it:\n```\nVIT:\n  USE_ORIGINAL_TRAJ_ATTN_CODE: False\n```\n\nTo the yaml configs file, then you do not need to pass it to the command line every time.\n\n## Using a Different Number of GPUs\n\nIf you want to use a smaller number of GPUs, you need to modify .yaml configuration files in [`configs/`](configs/). Specifically, you need to modify the NUM_GPUS, TRAIN.BATCH_SIZE, TEST.BATCH_SIZE, DATA_LOADER.NUM_WORKERS entries in each configuration file. The BATCH_SIZE entry should be the same or higher as the NUM_GPUS entry.\n\n## Using Different Self-Attention Schemes\n\nIf you want to experiment with different space-time self-attention schemes, e.g., joint space-time attention or divided space-time attention, use the following commands:\n\n\n```\npython tools/run_net.py \\\n  --cfg configs/K400/joint_224_16x4.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  NUM_GPUS 8 \\\n  TRAIN.BATCH_SIZE 8 \\\n```\n\nand\n\n```\npython tools/run_net.py \\\n  --cfg configs/K400/divided_224_16x4.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  NUM_GPUS 8 \\\n  TRAIN.BATCH_SIZE 8 \\\n```\n\n## Training Different Motionformer Variants\n\nIf you want to train more powerful Motionformer variants, e.g., Motionformer-HR (operating on 16-frame clips sampled at 336x336 spatial resolution), and Motionformer-L (operating on 32-frame clips sampled at 224x224 spatial resolution), use the following commands:\n\n```\npython tools/run_net.py \\\n  --cfg configs/K400/motionformer_336_16x8.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  NUM_GPUS 8 \\\n  TRAIN.BATCH_SIZE 8 \\\n```\n\nand\n\n```\npython tools/run_net.py \\\n  --cfg configs/K400/motionformer_224_32x3.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  NUM_GPUS 8 \\\n  TRAIN.BATCH_SIZE 8 \\\n```\n\nNote that for these models you will need a set of GPUs with ~32GB of memory.\n\n## Inference\n\nUse `TRAIN.ENABLE` and `TEST.ENABLE` to control whether training or testing is required for a given run. When testing, you also have to provide the path to the checkpoint model via TEST.CHECKPOINT_FILE_PATH.\n```\npython tools/run_net.py \\\n  --cfg configs/K400/motionformer_224_16x4.yaml \\\n  DATA.PATH_TO_DATA_DIR path_to_your_dataset \\\n  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \\\n  TRAIN.ENABLE False \\\n```\n\nAlterantively, you can modify provided SLURM script and run following:\n\n```\nsbatch slurm_scripts/test.sh configs/K400/motionformer_224_16x4.yaml path_to_your_checkpoint\n```\n\n## Single-Node Training via Slurm\n\nTo train Motionformer via Slurm, please check out our single node Slurm training script [`slurm_scripts/run_single_node_job.sh`](slurm_scripts/run_single_node_job.sh).\n\n```\nsbatch slurm_scripts/run_single_node_job.sh configs/K400/motionformer_224_16x4.yaml /your/job/dir/${JOB_NAME}/\n```\n\n## Multi-Node Training via Submitit\n\nDistributed training is available via Slurm and submitit\n\n```\npip install submitit\n```\n\nTo train Motionformer model on Kinetics using 8 nodes with 8 gpus each use the following command:\n```\npython run_with_submitit.py --cfg configs/K400/motionformer_224_16x4.yaml --job_dir  /your/job/dir/${JOB_NAME}/ --partition $PARTITION --num_shards 8 --use_volta32\n```\n\nWe provide a script for launching slurm jobs in [`slurm_scripts/run_multi_node_job.sh`](slurm_scripts/run_multi_node_job.sh).\n\n```\nsbatch slurm_scripts/run_multi_node_job.sh configs/K400/motionformer_224_16x4.yaml /your/job/dir/${JOB_NAME}/\n```\n\nPlease note that hyper-parameters in configs were used with 8 nodes with 8 gpus (32 GB). Please scale batch-size, and learning-rate appropriately for your cluster configuration.\n\n## Finetuning\n\nTo finetune from an existing PyTorch checkpoint add the following line in the command line, or you can also add it in the YAML config:\n\n```\nTRAIN.CHECKPOINT_EPOCH_RESET: True\nTRAIN.CHECKPOINT_FILE_PATH path_to_your_PyTorch_checkpoint\n```\n\n# Environment\n\nThe code was developed using python 3.8.5 on Ubuntu 20.04. For training, we used eight GPU compute nodes each node containing 8 Tesla V100 GPUs (32 GPUs in total). Other platforms or GPU cards have not been fully tested.\n\n# License\n\nThe majority of this work is licensed under [CC-NC 4.0 International license](LICENSE). However, portions of the project are available under separate license terms: [SlowFast](https://github.com/facebookresearch/SlowFast) and [pytorch-image-models](https://github.com/rwightman/pytorch-image-models) are licensed under the Apache 2.0 license.\n\n# Contributing\n\nWe actively welcome your pull requests. Please see [CONTRIBUTING.md](CONTRIBUTING.md) and [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) for more info.\n\n# Acknowledgements\n\nMotionformer is built on top of [PySlowFast](https://github.com/facebookresearch/SlowFast), [Timesformer](https://github.com/facebookresearch/TimeSformer) and [pytorch-image-models](https://github.com/rwightman/pytorch-image-models) by [Ross Wightman](https://github.com/rwightman). We thank the authors for releasing their code. If you use our model, please consider citing these works as well:\n\n```BibTeX\n@misc{fan2020pyslowfast,\n  author =       {Haoqi Fan and Yanghao Li and Bo Xiong and Wan-Yen Lo and\n                  Christoph Feichtenhofer},\n  title =        {PySlowFast},\n  howpublished = {\\url{https://github.com/facebookresearch/slowfast}},\n  year =         {2020}\n}\n```\n\n```BibTeX\n@inproceedings{gberta_2021_ICML,\n    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},\n    title = {Is Space-Time Attention All You Need for Video Understanding?},\n    booktitle   = {Proceedings of the International Conference on Machine Learning (ICML)}, \n    month = {July},\n    year = {2021}\n}\n```\n\n```BibTeX\n@misc{rw2019timm,\n  author = {Ross Wightman},\n  title = {PyTorch Image Models},\n  year = {2019},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  doi = {10.5281/zenodo.4414861},\n  howpublished = {\\url{https://github.com/rwightman/pytorch-image-models}}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FMotionformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacebookresearch%2FMotionformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacebookresearch%2FMotionformer/lists"}