https://github.com/apple/ml-aim
This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
https://github.com/apple/ml-aim
jax large-scale-vision-models mlx pytorch
Last synced: about 1 month ago
JSON representation
This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
- Host: GitHub
- URL: https://github.com/apple/ml-aim
- Owner: apple
- License: other
- Created: 2024-01-12T19:07:45.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-22T05:26:18.000Z (6 months ago)
- Last Synced: 2025-04-03T03:17:19.393Z (about 1 month ago)
- Topics: jax, large-scale-vision-models, mlx, pytorch
- Language: Python
- Homepage:
- Size: 788 KB
- Stars: 1,251
- Watchers: 26
- Forks: 61
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Autoregressive Pre-training of Large Vision Encoders
This repository is the entry point for all things AIM, a family of autoregressive models that push the boundaries of
visual and multimodal learning:- **AIMv2**: [`Multimodal Autoregressive Pre-training of Large Vision Encoders`](https://arxiv.org/abs/2411.14402) [[`BibTeX`](#citation)]
Enrico Fini*, Mustafa Shukor*, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju,
Victor Guilherme Turrisi da Costa, Louis BΓ©thune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang,
Joshua M. Susskind, and Alaaeldin El-Nouby*
- **AIMv1**: [`Scalable Pre-training of Large Autoregressive Image Models`](https://arxiv.org/abs/2401.08541) [[`BibTeX`](#citation)]
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar,
Joshua M Susskind, Armand Joulin.*: Equal technical contribution
If you're looking for the original AIM model (AIMv1), please refer to the README [here](aim-v1/README.md).
---
## Overview of AIMv2
We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective.
AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include:1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.
2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension.
3. Exhibits strong recognition performance with AIMv2-3B achieving *89.5% on ImageNet using a frozen trunk*.
## AIMv2 Model Gallery
We share with the community AIMv2 pre-trained checkpoints of varying capacities, pre-training resolutions:
+ [[`AIMv2 with 224px`]](#aimv2-with-224px)
+ [[`AIMv2 with 336px`]](#aimv2-with-336px)
+ [[`AIMv2 with 448px`]](#aimv2-with-448px)
+ [[`AIMv2 with Native Resolution`]](#aimv2-with-native-resolution)
+ [[`AIMv2 distilled ViT-Large`]](#aimv2-distilled-vit-large) (*recommended for multimodal applications*)
+ [[`Zero-shot Adapted AIMv2`]](#zero-shot-adapted-aimv2)## Installation
Please install PyTorch using the official [installation instructions](https://pytorch.org/get-started/locally/).
Afterward, install the package as:
```commandline
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'
pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'
```
We also offer [MLX](https://ml-explore.github.io/mlx/) backend support for research and experimentation on Apple silicon.
To enable MLX support, simply run:
```commandline
pip install mlx
```## Examples
### Using PyTorch
```python
from PIL import Imagefrom aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transformsimg = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="torch")
transform = val_transforms(img_size=336)inp = transform(img).unsqueeze(0)
features = model(inp)
```### Using MLX
```python
from PIL import Image
import mlx.core as mxfrom aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transformsimg = Image.open(...)
model = load_pretrained("aimv2-large-patch14-336", backend="mlx")
transform = val_transforms(img_size=336)inp = transform(img).unsqueeze(0)
inp = mx.array(inp.numpy())
features = model(inp)
```### Using JAX
```python
from PIL import Image
import jax.numpy as jnpfrom aim.v2.utils import load_pretrained
from aim.v1.torch.data import val_transformsimg = Image.open(...)
model, params = load_pretrained("aimv2-large-patch14-336", backend="jax")
transform = val_transforms(img_size=336)inp = transform(img).unsqueeze(0)
inp = jnp.array(inp)
features = model.apply({"params": params}, inp)
```## Pre-trained Checkpoints
The pre-trained models can be accessed via [HuggingFace Hub](https://huggingface.co/collections/apple/aimv2-6720fe1558d94c7805f7688c) as:
```python
from PIL import Image
from transformers import AutoImageProcessor, AutoModelimage = Image.open(...)
processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-336")
model = AutoModel.from_pretrained("apple/aimv2-large-patch14-336", trust_remote_code=True)inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
```### AIMv2 with 224px
model_id
#params
IN-1k
HF Link
Backbone
aimv2-large-patch14-224
0.3B
86.6
π€link
link
aimv2-huge-patch14-224
0.6B
87.5
π€link
link
aimv2-1B-patch14-224
1.2B
88.1
π€link
link
aimv2-3B-patch14-224
2.7B
88.5
π€link
link
### AIMv2 with 336px
model_id
#params
IN-1k
HF Link
Backbone
aimv2-large-patch14-336
0.3B
87.6
π€link
link
aimv2-huge-patch14-336
0.6B
88.2
π€link
link
aimv2-1B-patch14-336
1.2B
88.7
π€link
link
aimv2-3B-patch14-336
2.7B
89.2
π€link
link
### AIMv2 with 448px
model_id
#params
IN-1k
HF Link
Backbone
aimv2-large-patch14-448
0.3B
87.9
π€link
link
aimv2-huge-patch14-448
0.6B
88.6
π€link
link
aimv2-1B-patch14-448
1.2B
89.0
π€link
link
aimv2-3B-patch14-448
2.7B
89.5
π€link
link
### AIMv2 with Native Resolution
We additionally provide an AIMv2-L checkpoint that is finetuned to process a wide range of image resolutions and
aspect ratios. Regardless of the aspect ratio, the image is patchified (patch_size=14) and
*a 2D sinusoidal positional embedding* is added to the linearly projected input patches.
*This checkpoint supports number of patches in the range of [112, 4096]*.
model_id
#params
IN-1k
HF Link
Backbone
aimv2-large-patch14-native
0.3B
87.3
π€link
link
### AIMv2 distilled ViT-Large
We provide an AIMv2-L checkpoint distilled from AIMv2-3B that provides a remarkable performance for multimodal
understanding benchmarks.
Model
VQAv2
GQA
OKVQA
TextVQA
DocVQA
InfoVQA
ChartQA
SciQA
MMEp
AIMv2-L
80.2
72.6
60.9
53.9
26.8
22.4
20.3
74.5
1457
AIMv2-L-distilled
81.1
73.0
61.4
53.5
29.2
23.3
24.0
76.3
1627
model_id
#params
Res.
HF Link
Backbone
aimv2-large-patch14-224-distilled
0.3B
224px
π€link
link
aimv2-large-patch14-336-distilled
0.3B
336px
π€link
link
### Zero-shot Adapted AIMv2
We provide the AIMv2-L vision and text encoders after LiT tuning to enable zero-shot recognition.
model
#params
zero-shot IN1-k
Backbone
AIMv2-L
0.3B
77.0
link
## Citation
If you find our work useful, please consider citing us as:### AIMv2 bibtex
```bibtex
@misc{fini2024multimodal,
title={Multimodal Autoregressive Pre-training of Large Vision Encoders},
author={Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis BΓ©thune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby},
year={2024},
eprint={2411.14402},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```### AIMv1 bibtex
```bibtex
@InProceedings{pmlr-v235-el-nouby24a,
title = {Scalable Pre-training of Large Autoregressive Image Models},
author = {El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel \'{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand},
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
pages = {12371--12384},
year = {2024},
}
```## License
Please check out the repository [LICENSE](LICENSE) before using the provided code and models.