Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/Alibaba-MIIL/STAM

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)
https://github.com/Alibaba-MIIL/STAM

Last synced: 4 months ago
JSON representation

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Host: GitHub
URL: https://github.com/Alibaba-MIIL/STAM
Owner: Alibaba-MIIL
License: apache-2.0
Created: 2021-03-25T15:20:58.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2022-08-23T18:08:31.000Z (almost 2 years ago)
Last Synced: 2024-01-16T10:46:17.615Z (5 months ago)
Language: Python
Homepage:
Size: 38.1 KB
Stars: 215
Watchers: 11
Forks: 31
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-stars - Alibaba-MIIL/STAM - Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper) (Python)
StarryDivineSky - Alibaba-MIIL/STAM

README

# An Image is Worth 16x16 Words, What is a Video Worth?

[paper](https://arxiv.org/pdf/2103.13915.pdf)

Official PyTorch Implementation

> Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor

> DAMO Academy, Alibaba Group

**Abstract**

> Leading methods in the domain of action recognition try to
distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the
Art (SotA) accuracy, usually make use of 3D convolution
layers as a way to abstract the temporal information from
video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a
collection of closely sampled frames. Since each short clip
covers a small fraction of an input video, multiple clips are
sampled at inference in order to cover the whole temporal
length of the video. This leads to increased computational
load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing
the number of frames required for inference. Our approach
relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient
information in each frame. Therefore our approach is very
input efficient, and can achieve SotA results (on Kinetics
dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach
78.8 top-1 accuracy with ×30 less frames per video, and
×40 faster inference than the current leading method
>

## Update 2/5/2021: Improved results
Due to improved training hyperparameters, and using KD training, we were able to improve
STAM results on Kinetics400 (+ ~1.5%). We are releasing the pretrained weights of the improved
models (see Pretrained Models below).

## Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were
done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models
Top-1 Accuracy
(%)
Flops × views
(10^9)
# Input Frames
Runtime
(Videos/sec)

X3D-M
76.0
6.2 × 30
480
1.3

X3D-L
77.5
24.8 × 30
480
0.46

X3D-XL
79.1
48.4 × 30
480
N/A

X3D-XXL
80.4
194 × 30
480
N/A

TimeSformer-L
80.7
2380 × 3
288
N/A

ViViT-L
81.3
3992 × 12
384
N/A

STAM-8
77.5
135 × 1
8
---

STAM-16
79.3
270 × 1
16
20.0

STAM-32
79.95
540 × 1
32
---

STAM-64
80.5
1080 × 1
64
4.8

## Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

| Model name | checkpoint
| ------------ | :--------------: |
| STAM_8 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_8.pth) |
| STAM_16 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_16.pth) |
| STAM_32 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_32.pth) |
| STAM_64 | [link](https://miil-public-eu.oss-eu-central-1.aliyuncs.com/model-zoo/STAM/v2/stam_64.pth) |

## Reproduce Article Scores
We provide code for reproducing the validation top-1 score of STAM
models on Kinetics400. First, download pretrained models from the links above.

Then, run the infer.py script. For example, for stam_16 (input size 224)
run:
```bash
python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \
--model_name=stam_16
--input_size=224
```

## Citations

```bibtex
@misc{sharir2021image,
title = {An Image is Worth 16x16 Words, What is a Video Worth?},
author = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
year = {2021},
eprint = {2103.13915},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
```

## Acknowledgements

We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent
[repository of Ross Wightman](https://github.com/rwightman/pytorch-image-models). Check it out and give it a star while
you are at it.