An open API service indexing awesome lists of open source software.

https://github.com/apple/ml-aim

This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
https://github.com/apple/ml-aim

jax large-scale-vision-models mlx pytorch

Last synced: about 1 month ago
JSON representation

This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.

Awesome Lists containing this project

README

        

# Autoregressive Pre-training of Large Vision Encoders


AIMv2 arXiv
AIMv2 model gallery
AIMv1 arXiv
AIMv1 model gallery

This repository is the entry point for all things AIM, a family of autoregressive models that push the boundaries of
visual and multimodal learning:

- **AIMv2**: [`Multimodal Autoregressive Pre-training of Large Vision Encoders`](https://arxiv.org/abs/2411.14402) [[`BibTeX`](#citation)]


Enrico Fini*, Mustafa Shukor*, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju,
Victor Guilherme Turrisi da Costa, Louis BΓ©thune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang,
Joshua M. Susskind, and Alaaeldin El-Nouby*
- **AIMv1**: [`Scalable Pre-training of Large Autoregressive Image Models`](https://arxiv.org/abs/2401.08541) [[`BibTeX`](#citation)]

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar,
Joshua M Susskind, Armand Joulin.

*: Equal technical contribution

If you're looking for the original AIM model (AIMv1), please refer to the README [here](aim-v1/README.md).

---

## Overview of AIMv2
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective.
AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include:

1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension.
3. Exhibits strong recognition performance with AIMv2-3B achieving *89.5% on ImageNet using a frozen trunk*.

![gh_aimv2_dark](aim-v2/assets/aimv2_overview_dark.png#gh-dark-mode-only)
![gh_aimv2_light](aim-v2/assets/aimv2_overview_light.png#gh-light-mode-only)

## AIMv2 Model Gallery


PyTorch
JAX
MLX
HuggingFace

We share with the community AIMv2 pre-trained checkpoints of varying capacities, pre-training resolutions:

+ [[`AIMv2 with 224px`]](#aimv2-with-224px)
+ [[`AIMv2 with 336px`]](#aimv2-with-336px)
+ [[`AIMv2 with 448px`]](#aimv2-with-448px)
+ [[`AIMv2 with Native Resolution`]](#aimv2-with-native-resolution)
+ [[`AIMv2 distilled ViT-Large`]](#aimv2-distilled-vit-large) (*recommended for multimodal applications*)
+ [[`Zero-shot Adapted AIMv2`]](#zero-shot-adapted-aimv2)

## Installation
Please install PyTorch using the official [installation instructions](https://pytorch.org/get-started/locally/).
Afterward, install the package as:
```commandline
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'
```
We also offer [MLX](https://ml-explore.github.io/mlx/) backend support for research and experimentation on Apple silicon.
To enable MLX support, simply run:
```commandline
pip install mlx
```

## Examples

### Using PyTorch

```python
from PIL import Image

from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms

img = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="torch")
transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)
features = model(inp)
```

### Using MLX

```python
from PIL import Image
import mlx.core as mx

from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms

img = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="mlx")
transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)
inp = mx.array(inp.numpy())
features = model(inp)
```

### Using JAX

```python
from PIL import Image
import jax.numpy as jnp

from aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transforms

img = Image.open(...)
model, params = load_pretrained("aimv2-large-patch14-336", backend="jax")
transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)
inp = jnp.array(inp)
features = model.apply({"params": params}, inp)
```

## Pre-trained Checkpoints
The pre-trained models can be accessed via [HuggingFace Hub](https://huggingface.co/collections/apple/aimv2-6720fe1558d94c7805f7688c) as:
```python
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

image = Image.open(...)
processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-336")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-336", trust_remote_code=True)

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
```

### AIMv2 with 224px



model_id
#params
IN-1k
HF Link
Backbone




aimv2-large-patch14-224
0.3B
86.6
πŸ€—link
link


aimv2-huge-patch14-224
0.6B
87.5
πŸ€—link
link


aimv2-1B-patch14-224
1.2B
88.1
πŸ€—link
link


aimv2-3B-patch14-224
2.7B
88.5
πŸ€—link
link

### AIMv2 with 336px



model_id
#params
IN-1k
HF Link
Backbone




aimv2-large-patch14-336
0.3B
87.6
πŸ€—link
link


aimv2-huge-patch14-336
0.6B
88.2
πŸ€—link
link


aimv2-1B-patch14-336
1.2B
88.7
πŸ€—link
link


aimv2-3B-patch14-336
2.7B
89.2
πŸ€—link
link

### AIMv2 with 448px



model_id
#params
IN-1k
HF Link
Backbone




aimv2-large-patch14-448
0.3B
87.9
πŸ€—link
link


aimv2-huge-patch14-448
0.6B
88.6
πŸ€—link
link


aimv2-1B-patch14-448
1.2B
89.0
πŸ€—link
link


aimv2-3B-patch14-448
2.7B
89.5
πŸ€—link
link

### AIMv2 with Native Resolution
We additionally provide an AIMv2-L checkpoint that is finetuned to process a wide range of image resolutions and
aspect ratios. Regardless of the aspect ratio, the image is patchified (patch_size=14) and
*a 2D sinusoidal positional embedding* is added to the linearly projected input patches.
*This checkpoint supports number of patches in the range of [112, 4096]*.



model_id
#params
IN-1k
HF Link
Backbone




aimv2-large-patch14-native
0.3B
87.3
πŸ€—link
link

### AIMv2 distilled ViT-Large
We provide an AIMv2-L checkpoint distilled from AIMv2-3B that provides a remarkable performance for multimodal
understanding benchmarks.



Model
VQAv2
GQA
OKVQA
TextVQA
DocVQA
InfoVQA
ChartQA
SciQA
MMEp




AIMv2-L
80.2
72.6
60.9
53.9
26.8
22.4
20.3
74.5
1457


AIMv2-L-distilled
81.1
73.0
61.4
53.5
29.2
23.3
24.0
76.3
1627



model_id
#params
Res.
HF Link
Backbone




aimv2-large-patch14-224-distilled
0.3B
224px
πŸ€—link
link


aimv2-large-patch14-336-distilled
0.3B
336px
πŸ€—link
link

### Zero-shot Adapted AIMv2
We provide the AIMv2-L vision and text encoders after LiT tuning to enable zero-shot recognition.



model
#params
zero-shot IN1-k
Backbone




AIMv2-L
0.3B
77.0
link

## Citation
If you find our work useful, please consider citing us as:

### AIMv2 bibtex

```bibtex
@misc{fini2024multimodal,
title={Multimodal Autoregressive Pre-training of Large Vision Encoders},
author={Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis BΓ©thune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby},
year={2024},
eprint={2411.14402},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```

### AIMv1 bibtex

```bibtex
@InProceedings{pmlr-v235-el-nouby24a,
title = {Scalable Pre-training of Large Autoregressive Image Models},
author = {El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel \'{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {12371--12384},
year = {2024},
}
```

## License
Please check out the repository [LICENSE](LICENSE) before using the provided code and models.