{"id":15035763,"url":"https://github.com/apple/ml-4m","last_synced_at":"2025-05-14T12:12:34.917Z","repository":{"id":232294568,"uuid":"783932817","full_name":"apple/ml-4m","owner":"apple","description":"4M: Massively Multimodal Masked Modeling","archived":false,"fork":false,"pushed_at":"2025-03-07T15:31:36.000Z","size":12815,"stargazers_count":1717,"open_issues_count":14,"forks_count":103,"subscribers_count":34,"default_branch":"main","last_synced_at":"2025-05-03T20:02:42.375Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://4m.epfl.ch","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apple.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-08T21:31:33.000Z","updated_at":"2025-04-30T07:29:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"68065033-8ab4-4a01-8b77-53c2a2b5d2a5","html_url":"https://github.com/apple/ml-4m","commit_stats":null,"previous_names":["apple/ml-4m"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-4m","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-4m/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-4m/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apple%2Fml-4m/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apple","download_url":"https://codeload.github.com/apple/ml-4m/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254140768,"owners_count":22021220,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-24T20:29:26.070Z","updated_at":"2025-05-14T12:12:34.874Z","avatar_url":"https://github.com/apple.png","language":"Python","funding_links":[],"categories":["🏗️ 3D Generation \u0026 Depth","Python"],"sub_categories":["🛠️ Additional 3D Tools"],"readme":"# 4M: Massively Multimodal Masked Modeling\n\n*A framework for training any-to-any multimodal foundation models. \u003cbr\u003eScalable. Open-sourced. Across tens of modalities and tasks.*\n\nEPFL - Apple\n\n[`Website`](https://4m.epfl.ch) | [`BibTeX`](#citation)  | [`🤗 Demo`](https://huggingface.co/spaces/EPFL-VILAB/4M)\n\nOfficial implementation and pre-trained models for :\n\n[**4M: Massively Multimodal Masked Modeling**](https://arxiv.org/abs/2312.06647), NeurIPS 2023 (Spotlight) \u003cbr\u003e\n*[David Mizrahi](https://dmizrahi.com/)\\*, [Roman Bachmann](https://roman-bachmann.github.io/)\\*, [Oğuzhan Fatih Kar](https://ofkar.github.io/), [Teresa Yeo](https://aserety.github.io/), [Mingfei Gao](https://fly6464.github.io/), [Afshin Dehghan](https://www.afshindehghan.com/), [Amir Zamir](https://vilab.epfl.ch/zamir/)*\n\n[**4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities**](https://arxiv.org/abs/2406.09406), NeurIPS 2024 \u003cbr\u003e\n*[Roman Bachmann](https://roman-bachmann.github.io/)\\*, [Oğuzhan Fatih Kar](https://ofkar.github.io/)\\*, [David Mizrahi](https://dmizrahi.com/)\\*, [Ali Garjani](https://garjania.github.io/), [Mingfei Gao](https://fly6464.github.io/), [David Griffiths](https://www.dgriffiths.uk/), [Jiaming Hu](https://scholar.google.com/citations?user=vm3imKsAAAAJ\u0026hl=en), [Afshin Dehghan](https://www.afshindehghan.com/), [Amir Zamir](https://vilab.epfl.ch/zamir/)*\n\n\u003cbr\u003e\n\n![4M main figure](./assets/4M_main_fig_darkmode.png#gh-dark-mode-only)\n![4M main figure](./assets/4M_main_fig_lightmode.png#gh-light-mode-only)\n\n4M is a framework for training \"any-to-any\" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models. We are releasing code and models for \"4M: Massively Multimodal Masked Modeling\" (here denoted 4M-7), as well as \"4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities\" (here denoted 4M-21).\n\n## Table of contents\n- [Usage](#usage)\n    - [Installation](#installation)\n    - [Getting started](#getting-started)\n    - [Data](#data)\n    - [Tokenization](#tokenization)\n    - [4M Training](#4m-training)\n    - [Generation](#generation)\n- [Model Zoo](#model-zoo)\n    - [4M models](#4m-models)\n    - [4M text-to-image specialist models](#4m-text-to-image-specialist-models)\n    - [4M super-resolution models](#4m-super-resolution-models)\n    - [Tokenizers](#tokenizers)\n- [License](#license)\n- [Citation](#citation)\n\n## Usage\n\n### Installation\n\n1. Clone this repository and navigate to the root directory:\n```\ngit clone https://github.com/apple/ml-4m\ncd ml-4m\n```\n\n2. Create a new conda environment, then install the package and its dependencies:\n```\nconda create -n fourm python=3.9 -y\nconda activate fourm\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n\n3. Verify that CUDA is available in PyTorch by running the following in a Python shell:\n```\n# Run in Python shell\nimport torch\nprint(torch.cuda.is_available())  # Should return True\n```\nIf CUDA is not available, consider re-installing PyTorch following the [official installation instructions](https://pytorch.org/get-started/locally/). Likewise, if you want to install xFormers (optional, for faster tokenizers), follow [their README](https://github.com/facebookresearch/xformers) to ensure that the CUDA version is correct.\n\n### Getting started\n\nWe provide a demo wrapper to quickly get started with using 4M models for RGB-to-all or {caption, bounding boxes}-to-all generation tasks.\nFor example, to generate all modalities from a given RGB input, call:\n\n```python\nfrom fourm.demo_4M_sampler import Demo4MSampler, img_from_url\nsampler = Demo4MSampler(fm='EPFL-VILAB/4M-21_XL').cuda()\nimg = img_from_url('https://storage.googleapis.com/four_m_site/images/demo_rgb.png') # 1x3x224x224 ImageNet-standardized PyTorch Tensor\npreds = sampler({'rgb@224': img.cuda()}, seed=None) \nsampler.plot_modalities(preds, save_path=None)\n```\n\nYou should expect to see an output like the following:\n\n![4M demo sampler output](./assets/4M_demo_sample_darkmode.jpg#gh-dark-mode-only)\n![4M demo sampler output](./assets/4M_demo_sample_lightmode.jpg#gh-light-mode-only)\n\nFor performing caption-to-all generation, you can replace the sampler input by: `preds = sampler({'caption': 'A lake house with a boat in front [S_1]'})`.\nFor a list of available 4M models, please see the model zoo below, and see [README_GENERATION.md](README_GENERATION.md) for more instructions on generation.\n\n### Data  \n\nSee [README_DATA.md](README_DATA.md) for instructions on how to prepare aligned multimodal datasets.\n\n### Tokenization  \n\nSee [README_TOKENIZATION.md](README_TOKENIZATION.md) for instructions on how to train modality-specific tokenizers.\n\n### 4M Training\n\nSee [README_TRAINING.md](README_TRAINING.md) for instructions on how to train 4M models.\n\n### Generation\n\nSee [README_GENERATION.md](README_GENERATION.md) for instructions on how to use 4M models for inference / generation. We also provide a [generation notebook](notebooks/generation_4M-21.ipynb) that contains examples for 4M inference, specifically performing conditional image generation and common vision tasks (i.e. RGB-to-All).\n\n\n## Model Zoo\n\nWe provide 4M and tokenizer checkpoints as [safetensors](https://huggingface.co/docs/safetensors/en/index), and also offer easy loading via [Hugging Face Hub](https://huggingface.co/docs/hub/index).\n\n### 4M models\n\n| Model   | # Mod. | Datasets | # Params | Config | Weights         |\n| ------- | ------ | -------- | -------- | ------ | --------------- |\n| 4M-B | 7 | CC12M | 198M | [Config](cfgs/default/4m/models/main/4m-b_mod7_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7_B_CC12M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7_B_CC12M) |\n| 4M-B | 7 | COYO700M | 198M | [Config](cfgs/default/4m/models/main/4m-b_mod7_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7_B_COYO700M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7_B_COYO700M) |\n| 4M-B | 21 | CC12M+COYO700M+C4 | 198M | [Config](cfgs/default/4m/models/main/4m-b_mod21_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-21_B/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-21_B) |\n| 4M-L | 7 | CC12M | 705M | [Config](cfgs/default/4m/models/main/4m-l_mod7_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7_L_CC12M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7_L_CC12M) |\n| 4M-L | 7 | COYO700M | 705M | [Config](cfgs/default/4m/models/main/4m-l_mod7_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7_L_COYO700M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7_L_COYO700M) |\n| 4M-L | 21 | CC12M+COYO700M+C4 | 705M | [Config](cfgs/default/4m/models/main/4m-l_mod21_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-21_L/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-21_L) |\n| 4M-XL | 7 | CC12M | 2.8B | [Config](cfgs/default/4m/models/main/4m-xl_mod7_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7_XL_CC12M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7_XL_CC12M) |\n| 4M-XL | 7 | COYO700M | 2.8B | [Config](cfgs/default/4m/models/main/4m-xl_mod7_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7_XL_COYO700M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7_XL_COYO700M) |\n| 4M-XL | 21 | CC12M+COYO700M+C4 | 2.8B | [Config](cfgs/default/4m/models/main/4m-xl_mod21_500b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-21_XL/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-21_XL) |\n\nTo load models from Hugging Face Hub:\n```python\nfrom fourm.models.fm import FM\n\nfm7b_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7_B_CC12M')\nfm7b_coyo   = FM.from_pretrained('EPFL-VILAB/4M-7_B_COYO700M')\nfm21b       = FM.from_pretrained('EPFL-VILAB/4M-21_B')\n\nfm7l_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7_L_CC12M')\nfm7l_coyo   = FM.from_pretrained('EPFL-VILAB/4M-7_L_COYO700M')\nfm21l       = FM.from_pretrained('EPFL-VILAB/4M-21_L')\n\nfm7xl_cc12m = FM.from_pretrained('EPFL-VILAB/4M-7_XL_CC12M')\nfm7xl_coyo  = FM.from_pretrained('EPFL-VILAB/4M-7_XL_COYO700M')\nfm21xl      = FM.from_pretrained('EPFL-VILAB/4M-21_XL')\n```\n\nTo load the checkpoints manually, first download the safetensors files from the above links and call:\n```python\nfrom fourm.utils import load_safetensors\nfrom fourm.models.fm import FM\n\nckpt, config = load_safetensors('/path/to/checkpoint.safetensors')\nfm = FM(config=config)\nfm.load_state_dict(ckpt)\n```\n\n### 4M text-to-image specialist models\n\nThese models were initialized with the standard 4M-7 CC12M models, but continued training with a modality mixture heavily biased towards text inputs. They are still able to perform all other tasks, but perform better at text-to-image generation compared to the non-finetuned models.\n\n| Model   | # Mod. | Datasets | # Params | Config | Weights         |\n| ------- | ------ | -------- | -------- | ------ | --------------- |\n| 4M-T2I-B | 7 | CC12M | 198M | [Config](cfgs/default/4m/models/specialized/4m-b_mod7_500b--spec_text2im_100b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7-T2I_B_CC12M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7-T2I_B_CC12M) |\n| 4M-T2I-L | 7 | CC12M | 705M | [Config](cfgs/default/4m/models/specialized/4m-l_mod7_500b--spec_text2im_100b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7-T2I_L_CC12M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7-T2I_L_CC12M) |\n| 4M-T2I-XL | 7 | CC12M | 2.8B | [Config](cfgs/default/4m/models/specialized/4m-xl_mod7_500b--spec_text2im_100b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7-T2I_XL_CC12M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7-T2I_XL_CC12M) |\n\nTo load models from Hugging Face Hub:\n```python\nfrom fourm.models.fm import FM\n\nfm7b_t2i_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-T2I_B_CC12M')\nfm7l_t2i_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-T2I_L_CC12M')\nfm7xl_t2i_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-T2I_XL_CC12M')\n```\n\nLoading manually from checkpoints is performed in the same way as above for the base 4M models.\n\n### 4M super-resolution models\n\n| Model   | # Mod. | Datasets | # Params | Config | Weights         |\n| ------- | ------ | -------- | -------- | ------ | --------------- |\n| 4M-SR-L | 7 | CC12M | 198M | [Config](cfgs/default/4m/models/superres/4m-l_mod7_500b--sr_448_100b.yaml) | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M-7-SR_L_CC12M/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M-7-SR_L_CC12M) |\n\nTo load models from Hugging Face Hub:\n```python\nfrom fourm.models.fm import FM\n\nfm7l_sr_cc12m  = FM.from_pretrained('EPFL-VILAB/4M-7-SR_L_CC12M')\n```\n\nLoading manually from checkpoints is performed in the same way as above for the base 4M models.\n\n### Tokenizers\n\n| Modality                   | Resolution | Number of tokens | Codebook size   | Diffusion decoder | Weights |\n|----------------------------|------------|------------------|-----------------|-------------------|---------|\n| RGB                        | 224-448    | 196-784          | 16k             | ✓                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_rgb_16k_224-448/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_rgb_16k_224-448) |\n| Depth                      | 224-448    | 196-784          |  8k             | ✓                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_depth_8k_224-448/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_depth_8k_224-448) |\n| Normals                    | 224-448    | 196-784          |  8k             | ✓                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_normal_8k_224-448/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_normal_8k_224-448) |\n| Edges (Canny, SAM)         | 224-512    | 196-1024         |  8k             | ✓                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_edge_8k_224-512/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_edge_8k_224-512) |\n| COCO semantic segmentation | 224-448    | 196-784          |  4k             | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_semseg_4k_224-448/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_semseg_4k_224-448) |\n| CLIP-B/16                  | 224-448    | 196-784          |  8k             | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_CLIP-B16_8k_224-448/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_CLIP-B16_8k_224-448) |\n| DINOv2-B/14                | 224-448    | 256-1024         |  8k             | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_DINOv2-B14_8k_224-448/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_DINOv2-B14_8k_224-448) |\n| DINOv2-B/14 (global)       | 224        | 16               |  8k             | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_DINOv2-B14-global_8k_16_224/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_DINOv2-B14-global_8k_16_224) |\n| ImageBind-H/14             | 224-448    | 256-1024         |  8k             | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_ImageBind-H14_8k_224-448/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_ImageBind-H14_8k_224-448) |\n| ImageBind-H/14 (global)    | 224        | 16               | 8k              | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_ImageBind-H14-global_8k_16_224/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_ImageBind-H14-global_8k_16_224) |\n| SAM instances              | -          | 64               | 1k             | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_sam-instance_1k_64/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_sam-instance_1k_64) |\n| 3D Human poses             | -          | 8                | 1k             | ✗                 | [Checkpoint](https://huggingface.co/EPFL-VILAB/4M_tokenizers_human-poses_1k_8/resolve/main/model.safetensors) / [HF Hub](https://huggingface.co/EPFL-VILAB/4M_tokenizers_human-poses_1k_8) |\n\nTo load models from Hugging Face Hub:\n```python\nfrom fourm.vq.vqvae import VQVAE, DiVAE\n\n# 4M-7 modalities\ntok_rgb = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_rgb_16k_224-448')\ntok_depth = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_depth_8k_224-448')\ntok_normal = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_normal_8k_224-448')\ntok_semseg = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_semseg_4k_224-448')\ntok_clip = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_CLIP-B16_8k_224-448')\n\n# 4M-21 modalities\ntok_edge = DiVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_edge_8k_224-512')\ntok_dinov2 = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_DINOv2-B14_8k_224-448')\ntok_dinov2_global = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_DINOv2-B14-global_8k_16_224')\ntok_imagebind = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_ImageBind-H14_8k_224-448')\ntok_imagebind_global = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_ImageBind-H14-global_8k_16_224')\nsam_instance = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_sam-instance_1k_64')\nhuman_poses = VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_human-poses_1k_8')\n```\n\nTo load the checkpoints manually, first download the safetensors files from the above links and call:\n```python\nfrom fourm.utils import load_safetensors\nfrom fourm.vq.vqvae import VQVAE, DiVAE\n\nckpt, config = load_safetensors('/path/to/checkpoint.safetensors')\ntok = VQVAE(config=config) # Or DiVAE for models with a diffusion decoder\ntok.load_state_dict(ckpt)\n```\n\n\n## License\n\nThe code in this repository is released under the Apache 2.0 license as found in the [LICENSE](LICENSE) file.\n\nThe model weights in this repository are released under the Sample Code license as found in the [LICENSE_WEIGHTS](LICENSE_WEIGHTS) file.\n\n## Citation\n\nIf you find this repository helpful, please consider citing our work:\n```\n@inproceedings{4m,\n    title={{4M}: Massively Multimodal Masked Modeling},\n    author={David Mizrahi and Roman Bachmann and O{\\u{g}}uzhan Fatih Kar and Teresa Yeo and Mingfei Gao and Afshin Dehghan and Amir Zamir},\n    booktitle={Thirty-seventh Conference on Neural Information Processing Systems},\n    year={2023},\n}\n\n@article{4m21,\n    title={{4M-21}: An Any-to-Any Vision Model for Tens of Tasks and Modalities},\n    author={Roman Bachmann and O{\\u{g}}uzhan Fatih Kar and David Mizrahi and Ali Garjani and Mingfei Gao and David Griffiths and Jiaming Hu and Afshin Dehghan and Amir Zamir},\n    journal={arXiv 2024},\n    year={2024},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-4m","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapple%2Fml-4m","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapple%2Fml-4m/lists"}