https://github.com/facebookresearch/ijepa

Official codebase for I-JEPA, the Image-based Joint-Embedding Predictive Architecture. First outlined in the CVPR paper, "Self-supervised learning from images with a joint-embedding predictive architecture."
https://github.com/facebookresearch/ijepa

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/facebookresearch/ijepa
Owner: facebookresearch
License: other
Archived: true
Created: 2023-06-12T14:48:10.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-05-08T19:41:39.000Z (about 1 year ago)
Last Synced: 2024-12-17T01:37:35.223Z (6 months ago)
Language: Python
Homepage:
Size: 32.2 KB
Stars: 2,860
Watchers: 60
Forks: 361
Open Issues: 37
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

AiTreasureBox - facebookresearch/ijepa - 06-19_2994_0](https://img.shields.io/github/stars/facebookresearch/ijepa.svg)|Official codebase for I-JEPA, the Image-based Joint-Embedding Predictive Architecture. First outlined in the CVPR paper, "Self-supervised learning from images with a joint-embedding predictive architecture."| (Repos)

README

# I-JEPA

Official PyTorch codebase for I-JEPA (the **Image-based Joint-Embedding Predictive Architecture**) published @ CVPR-23.
[\[arXiv\]](https://arxiv.org/pdf/2301.08243.pdf) [\[JEPAs\]](https://ai.facebook.com/blog/yann-lecun-advances-in-ai-research/) [\[blogpost\]](https://ai.facebook.com/blog/yann-lecun-ai-model-i-jepa/)

## Method
I-JEPA is a method for self-supervised learning. At a high level, I-JEPA predicts the representations of part of an image from the representations of other parts of the same image. Notably, this approach learns semantic image features:
1. without relying on pre-specified invariances to hand-crafted data transformations, which tend to be biased for particular downstream tasks,
2. and without having the model fill in pixel-level details, which tend to result in learning less semantically meaningful representations.

![ijepa](https://github.com/facebookresearch/ijepa/assets/7530871/dbad94ab-ac35-433b-8b4c-ca227886d311)

## Visualizations

As opposed to generative methods that have a pixel decoder, I-JEPA has a predictor that makes predictions in latent space.
The predictor in I-JEPA can be seen as a primitive (and restricted) world-model that is able to model spatial uncertainty in a static image from a partially observable context.
This world model is semantic in the sense that it predicts high level information about unseen regions in the image, rather than pixel-level details.

We trained a stochastic decoder that maps the I-JEPA predicted representations back in pixel space as sketches.
The model correctly captures positional uncertainty and produces high-level object parts with the correct pose (e.g., dog’s head, wolf’s front legs).

![ijepa-predictor-sketch](https://github.com/facebookresearch/ijepa/assets/7530871/9b66e461-fc8b-4b12-9f06-63ec4dfc1452)
_{Caption: Illustrating how the predictor learns to model the semantics of the world. For each image, the portion outside of the blue box is encoded and given to the predictor as context. The predictor outputs a representation for what it expects to be in the region within the blue box. To visualize the prediction, we train a generative model that produces a sketch of the contents represented by the predictor output, and we show a sample output within the blue box. The predictor recognizes the semantics of what parts should be filled in (the top of the dog’s head, the bird’s leg, the wolf’s legs, the other side of the building).}

## Evaluations

I-JEPA pretraining is also computationally efficient.
It does not involve any overhead associated with applying more computationally intensive data augmentations to produce multiple views.
Only one view of the image needs to be processed by the target encoder, and only the context blocks need to be processed by the context encoder.
Empirically, I-JEPA learns strong off-the-shelf semantic representations without the use of hand-crafted view augmentations.

![1percenteval](https://github.com/facebookresearch/ijepa/assets/7530871/e6e5291f-ca51-43a4-a6cf-069811094ece)
![lineareval](https://github.com/facebookresearch/ijepa/assets/7530871/d8cffa73-5350-444e-987a-7e131a86d767)

## Pretrained models

arch.
patch size
resolution
epochs
data
download

ViT-H
14x14
224x224
300
ImageNet-1K
full checkpoint
logs
configs

ViT-H
16x16
448x448
300
ImageNet-1K
full checkpoint
logs
configs

ViT-H
14x14
224x224
66
ImageNet-22K
full checkpoint
logs
configs

ViT-g
16x16
224x224
44
ImageNet-22K
full checkpoint
logs
configs

## Code Structure

```
.
├── configs # directory in which all experiment '.yaml' configs are stored
├── src # the package
│ ├── train.py # the I-JEPA training loop
│ ├── helper.py # helper functions for init of models & opt/loading checkpoint
│ ├── transforms.py # pre-train data transforms
│ ├── datasets # datasets, data loaders, ...
│ ├── models # model definitions
│ ├── masks # mask collators, masking utilities, ...
│ └── utils # shared utilities
├── main_distributed.py # entrypoint for launch distributed I-JEPA pretraining on SLURM cluster
└── main.py # entrypoint for launch I-JEPA pretraining locally on your machine
```

**Config files:**
Note that all experiment parameters are specified in config files (as opposed to command-line-arguments). See the [configs/](configs/) directory for example config files.

## Launching I-JEPA pretraining

### Single-GPU training
This implementation starts from the [main.py](main.py), which parses the experiment config file and runs the pre-training locally on a multi-GPU (or single-GPU) machine. For example, to run I-JEPA pretraining on GPUs "0","1", and "2" on a local machine using the config [configs/in1k_vith14_ep300.yaml](configs/in1k_vith14_ep300.yaml), type the command:
```
python main.py \
--fname configs/in1k_vith14_ep300.yaml \
--devices cuda:0 cuda:1 cuda:2
```
*Note: This example is just used for illustrative purposes, as the ViT-H/14 config should be run on 16 A100 80G GPUs for an effective batch-size of 2048, in order to reproduce our results.*

### Multi-GPU training
In the multi-GPU setting, the implementation starts from [main_distributed.py](main_distributed.py), which, in addition to parsing the config file, also allows for specifying details about distributed training. For distributed training, we use the popular open-source [submitit](https://github.com/facebookincubator/submitit) tool and provide examples for a SLURM cluster.

For example, to pre-train on 16 A100 80G GPUs using the pre-training experiment configs specificed inside [configs/in1k_vith14_ep300.yaml](configs/in1k_vith14_ep300.yaml), type the command:
```
python main_distributed.py \
--fname configs/in1k_vith14_ep300.yaml \
--folder $path_to_save_submitit_logs \
--partition $slurm_partition \
--nodes 2 --tasks-per-node 8 \
--time 1000
```

---

### Requirements
* Python 3.8 (or newer)
* PyTorch 2.0
* torchvision
* Other dependencies: pyyaml, numpy, opencv, submitit

## License
See the [LICENSE](./LICENSE) file for details about the license under which this code is made available.

## Citation
If you find this repository useful in your research, please consider giving a star :star: and a citation
```
@article{assran2023self,
title={Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture},
author={Assran, Mahmoud and Duval, Quentin and Misra, Ishan and Bojanowski, Piotr and Vincent, Pascal and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas},
journal={arXiv preprint arXiv:2301.08243},
year={2023}
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/facebookresearch/ijepa

Awesome Lists containing this project

README