https://github.com/apple/ml-aim

This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.
https://github.com/apple/ml-aim

jax large-scale-vision-models mlx pytorch

Last synced: 26 days ago
JSON representation

This repository provides the code and model checkpoints for AIMv1 and AIMv2 research projects.

Host: GitHub
URL: https://github.com/apple/ml-aim
Owner: apple
License: other
Created: 2024-01-12T19:07:45.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-23T09:07:07.000Z (about 2 months ago)
Last Synced: 2025-05-03T20:02:42.369Z (about 1 month ago)
Topics: jax, large-scale-vision-models, mlx, pytorch
Language: Python
Homepage:
Size: 797 KB
Stars: 1,275
Watchers: 26
Forks: 59
Open Issues: 18
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

        # Autoregressive Pre-training of Large Vision Encoders









 



This repository is the entry point for all things AIM, a family of autoregressive models that push the boundaries of

visual and multimodal learning:

- **AIMv2**: [`Multimodal Autoregressive Pre-training of Large Vision Encoders`](https://arxiv.org/abs/2411.14402)  [[`BibTeX`](#citation)]

  


  Enrico Fini*, Mustafa Shukor*, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju,

  Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang,

  Joshua M. Susskind, and Alaaeldin El-Nouby*

- **AIMv1**: [`Scalable Pre-training of Large Autoregressive Image Models`](https://arxiv.org/abs/2401.08541) [[`BibTeX`](#citation)]


  Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar,

  Joshua M Susskind, Armand Joulin.

*: Equal technical contribution

If you're looking for the original AIM model (AIMv1), please refer to the README [here](aim-v1/README.md).

---

## Overview of AIMv2

We introduce the AIMv2 family of vision models pre-trained with a multimodal autoregressive objective.

AIMv2 pre-training is simple and straightforward to train and to scale effectively. Some AIMv2 highlights include:

1. Outperforms OAI CLIP and SigLIP on the majority of multimodal understanding benchmarks.

2. Outperforms DINOv2 on open-vocabulary object detection and referring expression comprehension.

3. Exhibits strong recognition performance with AIMv2-3B achieving *89.5% on ImageNet using a frozen trunk*.

![gh_aimv2_dark](aim-v2/assets/aimv2_overview_dark.png#gh-dark-mode-only)

![gh_aimv2_light](aim-v2/assets/aimv2_overview_light.png#gh-light-mode-only)

## AIMv2 Model Gallery













We share with the community AIMv2 pre-trained checkpoints of varying capacities, pre-training resolutions:

+ [[`AIMv2 with 224px`]](#aimv2-with-224px)

+ [[`AIMv2 with 336px`]](#aimv2-with-336px)

+ [[`AIMv2 with 448px`]](#aimv2-with-448px)

+ [[`AIMv2 with Native Resolution`]](#aimv2-with-native-resolution)

+ [[`AIMv2 distilled ViT-Large`]](#aimv2-distilled-vit-large) (*recommended for multimodal applications*)

+ [[`Zero-shot Adapted AIMv2`]](#zero-shot-adapted-aimv2)

## Installation

Please install PyTorch using the official [installation instructions](https://pytorch.org/get-started/locally/).

Afterward, install the package as:

```commandline

pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v1'

pip install 'git+https://github.com/apple/ml-aim.git#subdirectory=aim-v2'

```

We also offer [MLX](https://ml-explore.github.io/mlx/) backend support for research and experimentation on Apple silicon.

To enable MLX support, simply run:

```commandline

pip install mlx

```

## Examples

### Using PyTorch

```python

from PIL import Image

from aim.v2.utils import load_pretrained

from aim.v1.torch.data import val_transforms

img = Image.open(...)

model = load_pretrained("aimv2-large-patch14-336", backend="torch")

transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)

features = model(inp)

```

### Using MLX

```python

from PIL import Image

import mlx.core as mx

from aim.v2.utils import load_pretrained

from aim.v1.torch.data import val_transforms

img = Image.open(...)

model = load_pretrained("aimv2-large-patch14-336", backend="mlx")

transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)

inp = mx.array(inp.numpy())

features = model(inp)

```

### Using JAX

```python

from PIL import Image

import jax.numpy as jnp

from aim.v2.utils import load_pretrained

from aim.v1.torch.data import val_transforms

img = Image.open(...)

model, params = load_pretrained("aimv2-large-patch14-336", backend="jax")

transform = val_transforms(img_size=336)

inp = transform(img).unsqueeze(0)

inp = jnp.array(inp)

features = model.apply({"params": params}, inp)

```

## Pre-trained Checkpoints

The pre-trained models can be accessed via [HuggingFace Hub](https://huggingface.co/collections/apple/aimv2-6720fe1558d94c7805f7688c) as:

```python

from PIL import Image

from transformers import AutoImageProcessor, AutoModel

image = Image.open(...)

processor = AutoImageProcessor.from_pretrained("apple/aimv2-large-patch14-336")

model = AutoModel.from_pretrained("apple/aimv2-large-patch14-336", trust_remote_code=True)

inputs = processor(images=image, return_tensors="pt")

outputs = model(**inputs)

```

### AIMv2 with 224px

  

    

      model_id

      #params

      IN-1k

      HF Link

      Backbone

    

  

  

    

      aimv2-large-patch14-224

      0.3B

      86.6

      🤗link

      link

    

    

      aimv2-huge-patch14-224

      0.6B

      87.5

      🤗link

      link

    

    

      aimv2-1B-patch14-224

      1.2B

      88.1

      🤗link

      link

    

    

      aimv2-3B-patch14-224

      2.7B

      88.5

      🤗link

      link

    

  

### AIMv2 with 336px

  

    

      model_id

      #params

      IN-1k

      HF Link

      Backbone

    

  

  

    

      aimv2-large-patch14-336

      0.3B

      87.6

      🤗link

      link

    

    

      aimv2-huge-patch14-336

      0.6B

      88.2

      🤗link

      link

    

    

      aimv2-1B-patch14-336

      1.2B

      88.7

      🤗link

      link

    

    

      aimv2-3B-patch14-336

      2.7B

      89.2

      🤗link

      link

    

  

### AIMv2 with 448px

  

    

      model_id

      #params

      IN-1k

      HF Link

      Backbone

    

  

  

    

      aimv2-large-patch14-448

      0.3B

      87.9

      🤗link

      link

    

    

      aimv2-huge-patch14-448

      0.6B

      88.6

      🤗link

      link

    

    

      aimv2-1B-patch14-448

      1.2B

      89.0

      🤗link

      link

    

    

      aimv2-3B-patch14-448

      2.7B

      89.5

      🤗link

      link

    

  

### AIMv2 with Native Resolution

We additionally provide an AIMv2-L checkpoint that is finetuned to process a wide range of image resolutions and

aspect ratios. Regardless of the aspect ratio, the image is patchified (patch_size=14) and

*a 2D sinusoidal positional embedding* is added to the linearly projected input patches.

*This checkpoint supports number of patches in the range of [112, 4096]*.

  

    

      model_id

      #params

      IN-1k

      HF Link

      Backbone

    

  

  

    

      aimv2-large-patch14-native

      0.3B

      87.3

      🤗link

      link

    

  

### AIMv2 distilled ViT-Large

We provide an AIMv2-L checkpoint distilled from AIMv2-3B that provides a remarkable performance for multimodal

understanding benchmarks.

  

    

      Model

      VQAv2

      GQA

      OKVQA

      TextVQA

      DocVQA

      InfoVQA

      ChartQA

      SciQA

      MMEp

    

  

  

    

      AIMv2-L

      80.2

      72.6

      60.9

      53.9

      26.8

      22.4

      20.3

      74.5

      1457

     

    

      AIMv2-L-distilled

      81.1

      73.0

      61.4

      53.5

      29.2

      23.3

      24.0

      76.3

      1627

    

  

  

    

      model_id

      #params

      Res.

      HF Link

      Backbone

    

  

  

    

      aimv2-large-patch14-224-distilled

      0.3B

      224px

      🤗link

      link

    

    

      aimv2-large-patch14-336-distilled

      0.3B

      336px

      🤗link

      link

    

  

### Zero-shot Adapted AIMv2

We provide the AIMv2-L vision and text encoders after LiT tuning to enable zero-shot recognition.

  

    

      model

      #params

      zero-shot IN1-k

      Backbone

    

  

  

    

      AIMv2-L

      0.3B

      77.0

      link

    

  

## Citation

If you find our work useful, please consider citing us as:

### AIMv2 bibtex

```bibtex

@misc{fini2024multimodal,

    title={Multimodal Autoregressive Pre-training of Large Vision Encoders},

    author={Enrico Fini and Mustafa Shukor and Xiujun Li and Philipp Dufter and Michal Klein and David Haldimann and Sai Aitharaju and Victor Guilherme Turrisi da Costa and Louis Béthune and Zhe Gan and Alexander T Toshev and Marcin Eichner and Moin Nabi and Yinfei Yang and Joshua M. Susskind and Alaaeldin El-Nouby},

    year={2024},

    eprint={2411.14402},

    archivePrefix={arXiv},

    primaryClass={cs.CV}

}

```

### AIMv1 bibtex

```bibtex

@InProceedings{pmlr-v235-el-nouby24a,

  title     = {Scalable Pre-training of Large Autoregressive Image Models},

  author    = {El-Nouby, Alaaeldin and Klein, Michal and Zhai, Shuangfei and Bautista, Miguel \'{A}ngel and Shankar, Vaishaal and Toshev, Alexander T and Susskind, Joshua M. and Joulin, Armand},

  booktitle = {Proceedings of the 41st International Conference on Machine Learning},

  pages     = {12371--12384},

  year      = {2024},

}

```

## License

Please check out the repository [LICENSE](LICENSE) before using the provided code and models.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apple/ml-aim

Awesome Lists containing this project

README