https://github.com/ofa-sys/one-peace

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
https://github.com/ofa-sys/one-peace

audio-language contrastive-loss foundation-models multimodal representation-learning vision-and-language vision-language vision-transformer

Last synced: 6 months ago
JSON representation

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Host: GitHub
URL: https://github.com/ofa-sys/one-peace
Owner: OFA-Sys
License: apache-2.0
Created: 2023-05-18T23:53:24.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-10-06T04:13:22.000Z (about 1 year ago)
Last Synced: 2024-11-29T15:50:51.633Z (11 months ago)
Topics: audio-language, contrastive-loss, foundation-models, multimodal, representation-learning, vision-and-language, vision-language, vision-transformer
Language: Python
Homepage:
Size: 29.9 MB
Stars: 977
Watchers: 14
Forks: 63
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          



    


    

    






        📖 Paper&nbsp&nbsp ｜ &nbsp🤗 Demo&nbsp&nbsp | &nbsp&nbsp🤖 ModelScope&nbsp&nbsp | &nbsp&nbspCheckpoints&nbsp ｜ &nbspDatasets






ONE-PEACE is a general representation model across vision, audio, and language modalities,

Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, 

audio, audio-language, and vision-language tasks.

Furthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities

that are not paired in the training data.

Below shows the architecture and pretraining tasks of ONE-PEACE.

With the scaling-friendly architecture and modality-agnostic tasks, ONE-PEACE has the potential to expand to unlimited modalities.










# Online Demo

We provide the [online demo](https://huggingface.co/spaces/OFA-Sys/ONE-PEACE_Multimodal_Retrieval) in Huggingface Spaces. In this demo, you can combine multiple modalities to retrieve related images, such as audio-to-image, audio+text-to-image, audio+image-to-image, and even audio+image+text-to-image.










# News

* **2023.7.20:** Released the [visual grounding API](https://github.com/OFA-Sys/ONE-PEACE#visual-grounding), you can use it to locate objects from the picture.

* **2023.6.23:** Released vision tasks fine-tuning scripts and checkpoints. See [guidance for vision tasks](one_peace_vision/README.md) for more details.

* **2023.6.04:** Released the pretraining scripts. See [guidance for pretraining](one_peace/README.md/##Pretraining) for more details.

* **2023.5.30:** Released the finetuned checkpoints and scripts for audio(-language) tasks.

* **2023.5.29:** Released the finetuned checkpoints for vision-language tasks.

* **2023.5.27:** 🔥 We have provided the [multimodal retrieval demo](https://huggingface.co/spaces/OFA-Sys/ONE-PEACE_Multimodal_Retrieval) in huggingface spaces. Have Fun!

* **2023.5.25:** Released the [multimodal embedding API](https://github.com/OFA-Sys/ONE-PEACE#multi-modal-embedding), which enables the quick extraction for image, audio and text representations.

* **2023.5.23:** Released the [pretrained checkpoint](checkpoints.md), as well as [finetuning & inference scripts](one_peace/README.md) for vision-language tasks.

* **2023.5.19:** Released the paper and code. Pretrained & finetuned checkpoints, training & inference scripts, as well as demos will be released as soon as possible.




# Models and Results

## Model Card

We list the parameters and pretrained checkpoints of ONE-PEACE below. Note that ONE-PEACE can be disassembled into different branches to handle different tasks.

We also provide the vision-branch of ONE-PEACE, which can be used to perform vision tasks.

    

        ModelCkptParamsHidden sizeIntermediate sizeAttention headsLayers

    

    

        ONE-PEACEDownload4B153661442440

    

    

        ONE-PEACE
(Vision Branch)Download1.5B153661442440

    




## Results

### Vision Tasks

    

        TaskImage classificationSemantic SegmentationObject Detection (w/o Object365)Video Action Recognition

    

    

        DatasetImagenet-1KADE20KCOCOKinetics 400

    

    

        Splitvalvalvalval

    

    

        MetricAcc.mIoU^ss / mIoU^msAP^box / AP^maskTop-1 Acc. / Top-5 Acc.

    

    

        ONE-PEACE89.862.0 / 63.060.4 / 52.988.1 / 97.8

    

### Audio Tasks

    

        TaskAudio-Text RetrievalAudio ClassificationAudio Question Answering

    

    

        DatasetAudioCapsClothoESC-50FSD50KVGGSound (Audio-Visual)AVQA

    

    

        Splittestevaluationfullevaltestval

    

    

        MetricT2A R@1A2T R@1T2A R@1A2T R@1Zero-shot Acc.MAPAcc.Acc.

    

    

        ONE-PEACE42.551.022.427.191.869.768.292.2

    

### Vision-Language Tasks

    

        TaskImage-Text Retrieval (w/o ranking)Visual GroundingVQAVisual Reasoning

    

    

        DatasetCOCOFlickr30KRefCOCORefCOCO+RefCOCOgVQAv2NLVR2

    

    

        Splittesttestval / testA / testBval / testA / testBval-u / test-utest-dev / test-stddev / test-P

    

    

        MetricI2T R@1T2I R@1I2T R@1T2I R@1Acc@0.5Acc.Acc.

    

    

        ONE-PEACE84.165.497.689.692.58 / 94.18 / 89.2688.77 / 92.21 / 83.2389.22 / 89.2782.6 / 82.587.8 / 88.3

    




# Requirements and Installation

* 3.6 <= Python <=3.10

* Pytorch >= 1.10.0 (recommend 1.13.1)

* CUDA Version >= 10.2 (recommend 11.6)

* Install required packages:

```bash

git clone https://github.com/OFA-Sys/ONE-PEACE

cd ONE-PEACE

pip install -r requirements.txt

```

* For faster training install [Apex](https://github.com/NVIDIA/apex) library (optional):

```bash

git clone https://github.com/NVIDIA/apex

cd apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

```

* Install [Xformers](https://github.com/facebookresearch/xformers) library to use Memory-efficient attention (optional):

```bash

conda install xformers -c xformers

```

* Install [FlashAttention](https://github.com/HazyResearch/flash-attention) library to use faster LayerNorm (optional):

```bash

git clone --recursive https://github.com/HazyResearch/flash-attention

cd flash-attention && pip install .

cd csrc/layer_norm && pip install .

```




# Datasets and Checkpoints

See [datasets.md](datasets.md) and [checkpoints.md](checkpoints.md).




# Usage

## API

We provide a simple code snippet to show how to use the API for ONE-PEACE.

### Multi-modal Embedding

We use ONE-PEACE to compute embeddings for text, images, and audio, as well as their similarities:

```python

import torch

from one_peace.models import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"

# "ONE-PEACE" can also be replaced with ckpt path

model = from_pretrained("ONE-PEACE", device=device, dtype="float32")

# process raw data

src_tokens = model.process_text(["cow", "dog", "elephant"])

src_images = model.process_image(["assets/dog.JPEG", "assets/elephant.JPEG"])

src_audios, audio_padding_masks = model.process_audio(["assets/cow.flac", "assets/dog.flac"])

with torch.no_grad():

    # extract normalized features

    text_features = model.extract_text_features(src_tokens)

    image_features = model.extract_image_features(src_images)

    audio_features = model.extract_audio_features(src_audios, audio_padding_masks)

    # compute similarity

    i2t_similarity = image_features @ text_features.T

    a2t_similarity = audio_features @ text_features.T

print("Image-to-text similarities:", i2t_similarity)

print("Audio-to-text similarities:", a2t_similarity)

```

### Visual Grounding

We use ONE-PEACE to perform visual grounding on anime pictures:

```python

import torch

import cv2

from one_peace.models import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"

model = from_pretrained(

	"ONE-PEACE_Grounding",

    model_type="one_peace_classify",

    device=device,

    dtype="float32"

)

# process raw data

image_text_list = [

    ("assets/pokemons.jpg", "a blue turtle-like pokemon with round head"),

    ("assets/pokemons.jpg", "Bulbasaur"),

    ("assets/pokemons.jpg", "Charmander"),

    ("assets/pokemons.jpg", "Squirtle"),

    ("assets/one_piece.jpeg", "Brook"),

    ("assets/one_piece.jpeg", "Franky"),

    ("assets/one_piece.jpeg", "Monkey D. Luffy"),

    ("assets/one_piece.jpeg", "Nami"),

    ("assets/one_piece.jpeg", "Nico Robin"),

    ("assets/one_piece.jpeg", "Roronoa Zoro"),

    ("assets/one_piece.jpeg", "Tony Tony Chopper"),

    ("assets/one_piece.jpeg", "Usopp"),

    ("assets/one_piece.jpeg", "Vinsmoke Sanji"),

]

(src_images, image_widths, image_heights), src_tokens  = model.process_image_text_pairs(

    image_text_list, return_image_sizes=True

)

with torch.no_grad():

    # extract features

    vl_features = model.extract_vl_features(src_images, src_tokens).sigmoid()

    # extract coords

    vl_features[:, ::2] *= image_widths.unsqueeze(1)

    vl_features[:, 1::2] *= image_heights.unsqueeze(1)

    coords = vl_features.cpu().tolist()

# display results

for i, image_text_pair in enumerate(image_text_list):

    image, text = image_text_pair

    img = cv2.imread(image)

    cv2.rectangle(

        img,

        (int(coords[i][0]), int(coords[i][1])),

        (int(coords[i][2]), int(coords[i][3])),

        (0, 255, 0),

        3

    )

    cv2.imshow(text, img)

    cv2.waitKey(3500)

    cv2.destroyAllWindows()

```

### Audio Classification

We use ONE-PEACE to perform audio classification:

```python

import torch

import json

from one_peace.models import from_pretrained

id2label = json.load(open("assets/vggsound_id2label.json"))

device = "cuda" if torch.cuda.is_available() else "cpu"

model = from_pretrained(

  "ONE-PEACE_VGGSound",

    model_type="one_peace_classify",

    device=device,

    dtype="float32"

)

# process audio

audio_list = ["assets/cow.flac", "assets/dog.flac"]

src_audios, audio_padding_masks = model.process_audio(audio_list)

with torch.no_grad():

    # extract audio features

    audio_logits = model.extract_audio_features(src_audios, audio_padding_masks)

    print(audio_logits.size())

    predict_label_ids = audio_logits.argmax(1).cpu().tolist()

for audio, predict_label_id in zip(audio_list, predict_label_ids):

    predict_label = id2label[str(predict_label_id)]

    print('audio: {}, predict label: {}'.format(audio, predict_label))

```

## Training & Inference

If you are not satisfied with only using the API, we offer comprehensive training and inference instructions for [audio & multimodal](one_peace/README.md) and [vision](one_peace_vision/README.md) tasks.




# Gallery

## Visual Grounding (unseen domain)

![grounding](assets/grounding.png)

## Emergent Zero-shot Retrieval

![a2i](assets/audio2img.png)

![a+t2i](assets/audio+text2img.png)

![a+i2i](assets/audio+img2img.png)




# Acknowledgement

* [Fairseq](https://github.com/pytorch/fairseq) A sequence modeling toolkit with flexible configuration and highly extensible code structure.

* [xFormers](https://github.com/facebookresearch/xformers) A toolbox to accelerate research on Transformers.

* [FlashAttention](https://github.com/HazyResearch/flash-attention) A repository that provides the official implementation of FlashAttention, which greatly speeds up multi-head attention.

* [Apex](https://github.com/NVIDIA/apex) A repository that provides useful model acceleration and memory optimization techniques.




## Getting Involved

Feel free to submit GitHub issues or pull requests. Welcome to contribute to our project!

To contact us, never hestitate to send an email to `zheluo.wp@alibaba-inc.com` or `saimeng.wsj@alibaba-inc.com`!




# Citation

If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)

```BibTeX

@article{wang2023one,

  title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},

  author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},

  journal={arXiv preprint arXiv:2305.11172},

  year={2023}

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ofa-sys/one-peace

Awesome Lists containing this project

README