Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/facebookresearch/ImageBind
ImageBind One Embedding Space to Bind Them All
https://github.com/facebookresearch/ImageBind
Last synced: 11 days ago
JSON representation
ImageBind One Embedding Space to Bind Them All
- Host: GitHub
- URL: https://github.com/facebookresearch/ImageBind
- Owner: facebookresearch
- License: other
- Created: 2023-03-23T15:52:47.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-31T18:44:13.000Z (3 months ago)
- Last Synced: 2024-10-10T13:41:32.155Z (29 days ago)
- Language: Python
- Size: 2.56 MB
- Stars: 8,270
- Watchers: 99
- Forks: 758
- Open Issues: 82
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome-foundation-model - ImageBind
- AiTreasureBox - facebookresearch/ImageBind - 11-02_8330_1](https://img.shields.io/github/stars/facebookresearch/ImageBind.svg) |ImageBind One Embedding Space to Bind Them All| (Repos)
- ai-game-devtools - ImageBind
- awesome-generative-ai - facebookresearch/ImageBind
- StarryDivineSky - facebookresearch/ImageBind - 图像、文本、音频、深度、热量和 IMU 数据(惯性测量单元,是用来测量物体加速度、角速度、磁场,高度等)。它支持“开箱即用”的新型紧急应用,包括跨模态检索、使用算术组合模态、跨模态检测和生成。 (其他_机器视觉 / 网络服务_其他)
- awesome-llm-and-aigc - ImageBind
README
# ImageBind: One Embedding Space To Bind Them All
**[FAIR, Meta AI](https://ai.facebook.com/research/)**
Rohit Girdhar*,
Alaaeldin El-Nouby*,
Zhuang Liu,
Mannat Singh,
Kalyan Vasudev Alwala,
Armand Joulin,
Ishan Misra*To appear at CVPR 2023 (*Highlighted paper*)
[[`Paper`](https://facebookresearch.github.io/ImageBind/paper)] [[`Blog`](https://ai.facebook.com/blog/imagebind-six-modalities-binding-ai/)] [[`Demo`](https://imagebind.metademolab.com/)] [[`Supplementary Video`](https://dl.fbaipublicfiles.com/imagebind/imagebind_video.mp4)] [[`BibTex`](#citing-imagebind)]
PyTorch implementation and pretrained models for ImageBind. For details, see the paper: **[ImageBind: One Embedding Space To Bind Them All](https://facebookresearch.github.io/ImageBind/paper)**.
ImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.
![ImageBind](https://user-images.githubusercontent.com/8495451/236859695-ffa13364-3e39-4d99-a8da-fbfab17f9a6b.gif)
## ImageBind model
Emergent zero-shot classification performance.
Model
IN1k
K400
NYU-D
ESC
LLVIP
Ego4D
download
imagebind_huge
77.7
50.0
54.0
66.9
63.4
25.0
checkpoint
## Usage
Install pytorch 1.13+ and other 3rd party dependencies.
```shell
conda create --name imagebind python=3.10 -y
conda activate imagebindpip install .
```For windows users, you might need to install `soundfile` for reading/writing audio files. (Thanks @congyue1977)
```
pip install soundfile
```Extract and compare features across modalities (e.g. Image, Text and Audio).
```python
from imagebind import data
import torch
from imagebind.models import imagebind_model
from imagebind.models.imagebind_model import ModalityTypetext_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)# Load data
inputs = {
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}with torch.no_grad():
embeddings = model(inputs)print(
"Vision x Text: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
"Audio x Text: ",
torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
"Vision x Audio: ",
torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)# Expected output:
#
# Vision x Text:
# tensor([[9.9761e-01, 2.3694e-03, 1.8612e-05],
# [3.3836e-05, 9.9994e-01, 2.4118e-05],
# [4.7997e-05, 1.3496e-02, 9.8646e-01]])
#
# Audio x Text:
# tensor([[1., 0., 0.],
# [0., 1., 0.],
# [0., 0., 1.]])
#
# Vision x Audio:
# tensor([[0.8070, 0.1088, 0.0842],
# [0.1036, 0.7884, 0.1079],
# [0.0018, 0.0022, 0.9960]])```
## Model card
Please see the [model card](model_card.md) for details.## License
ImageBind code and model weights are released under the CC-BY-NC 4.0 license. See [LICENSE](LICENSE) for additional details.
## Contributing
See [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md).
## Citing ImageBind
If you find this repository useful, please consider giving a star :star: and citation
```
@inproceedings{girdhar2023imagebind,
title={ImageBind: One Embedding Space To Bind Them All},
author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},
booktitle={CVPR},
year={2023}
}
```