https://github.com/quva-lab/sigma

eccv2024 kinetics representation-learning self-supervised-learning something-something-v2 video

Last synced: 9 months ago
JSON representation

Host: GitHub
URL: https://github.com/quva-lab/sigma
Owner: QUVA-Lab
License: bsd-3-clause-clear
Created: 2024-07-16T09:20:08.000Z (almost 2 years ago)
Default Branch: gh-pages
Last Pushed: 2024-12-30T08:38:21.000Z (over 1 year ago)
Last Synced: 2025-04-21T10:55:18.302Z (about 1 year ago)
Topics: eccv2024, kinetics, representation-learning, self-supervised-learning, something-something-v2, video
Language: Python
Homepage: https://quva-lab.github.io/SIGMA/
Size: 8.98 MB
Stars: 16
Watchers: 7
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # SIGMA: Sinkhorn-Guided Masked Video Modeling (ECCV 2024).

[![arXiv](https://img.shields.io/badge/cs.CV-2407.15447-b31b1b?logo=arxiv&logoColor=red)](https://arxiv.org/html/2407.15447v1)

[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Models-SIGMA-blue)](https://huggingface.co/SMSD75/SIGMA) 

[![Static Badge](https://img.shields.io/badge/website-SIGMA-8A2BE2)](https://quva-lab.github.io/SIGMA/](https://quva-lab.github.io/SIGMA/))

![SIGMA Framework](figs/method.jpg)

 

### 🔥 Sinkhorn-Guided Masked Video Modeling

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods.

### ✨ Something-Something V2

|  Method  | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Epoch | Top-1 |

| :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: |

| VideoMAE |  ***no***  |  ViT-S   |  224x224   |         16x2x3          | 2400  | 66.8  |

| VideoMAE |  ***no***  |  ViT-B   |  224x224   |         16x2x3          | 800   | 69.6  |

| SIGMA    |***Img-1k***|  ViT-S   |  224x224   |         16x2x3          | 2400  | 68.6  |

| SIGMA    |***Img-1k***|  ViT-B   |  224x224   |         16x2x3          | 800   | 70.9  |

### ✨ Kinetics-400

|  Method  | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Epoch | Top-1 |

| :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: |

| VideoMAE |  ***no***  |  ViT-S   |  224x224   |         16x5x3          |  1600 | 79.0  |

| VideoMAE |  ***no***  |  ViT-B   |  224x224   |         16x5x3          |  800  | 80.0  |

| SIGMA    |***Img-1k***|  ViT-S   |  224x224   |         16x5x3          |  800  | 79.4  |

| SIGMA    |***Img-1k***|  ViT-B   |  224x224   |         16x5x3          |  800  | 81.6  |

## 🔨 Installation

Please follow the instructions in [INSTALL.md](INSTALL.md).

## ➡️ Data Preparation

Please follow the instructions in [DATASET.md](DATASET.md) for data preparation.

## 🔄 Pre-training

The pre-training instruction is in [PRETRAIN.md](PRETRAIN.md).

You can find our models on Huggingface (https://huggingface.co/SMSD75/SIGMA), including both the pretrained and **finetuned** versions.

## ⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in [FINETUNE.md](FINETUNE.md).

## 📍Model Zoo

## ⚠️ Our code is based on [VideoMAE](https://github.com/MCG-NJU/VideoMAE) code base. 

## ✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

```

@inproceedings{salehi2024sigma,

  title={SIGMA: Sinkhorn-Guided Masked Video Modeling},

  author={Salehi, Mohammadreza and Dorkenwald, Michael and Thoker, Fida Mohammad and Gavves, Efstratios and Snoek, Cees GM and Asano, Yuki M},

  journal={European Conference of Computer Vision},

  year={2024}

}

```

``` -->

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/quva-lab/sigma

Awesome Lists containing this project

README