https://github.com/pku-yuangroup/chat-univi

[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
https://github.com/pku-yuangroup/chat-univi
image-understanding large-language-models video-understanding vision-language-model
Last synced: 5 months ago
JSON representation
[CVPR 2024 Highlight🔥] Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Host: GitHub
URL: https://github.com/pku-yuangroup/chat-univi
Owner: PKU-YuanGroup
License: apache-2.0
Created: 2023-11-13T11:52:56.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-10-16T11:08:54.000Z (12 months ago)
Last Synced: 2025-04-14T18:04:54.433Z (6 months ago)
Topics: image-understanding, large-language-models, video-understanding, vision-language-model
Language: Python
Homepage: https://arxiv.org/abs/2311.08046
Size: 38.2 MB
Stars: 931
Watchers: 9
Forks: 46
Open Issues: 19
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          






 Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding



 If you like our project, please give us a star ⭐ on GitHub for the latest update.




[![Demo](https://img.shields.io/badge/⚡-Hugging%20Face%20Demo-yellow.svg)](https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi)

[![hf](https://img.shields.io/badge/🤗-Hugging%20Face-blue.svg)](https://huggingface.co/Chat-UniVi)

[![arXiv](https://img.shields.io/badge/Arxiv-2311.08046-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2311.08046)

[![License](https://img.shields.io/badge/Code%20License-Apache2.0-yellow)](https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/LICENSE)

[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FPKU-YuanGroup%2FChat-UniVi&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=Visitor&edge_flat=false)](https://hits.seeyoufarm.com)

[![GitHub issues](https://img.shields.io/github/issues/PKU-YuanGroup/Chat-UniVi?color=critical&label=Issues)](https://github.com/PKU-YuanGroup/Chat-UniVi/issues?q=is%3Aopen+is%3Aissue)

[![GitHub closed issues](https://img.shields.io/github/issues-closed/PKU-YuanGroup/Chat-UniVi?color=success&label=Issues)](https://github.com/PKU-YuanGroup/Chat-UniVi/issues?q=is%3Aissue+is%3Aclosed)



[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/image-based-generative-performance)](https://paperswithcode.com/sota/image-based-generative-performance?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance)](https://paperswithcode.com/sota/video-based-generative-performance?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-1)](https://paperswithcode.com/sota/video-based-generative-performance-1?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-5)](https://paperswithcode.com/sota/video-based-generative-performance-5?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-4)](https://paperswithcode.com/sota/video-based-generative-performance-4?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-3)](https://paperswithcode.com/sota/video-based-generative-performance-3?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/video-based-generative-performance-2)](https://paperswithcode.com/sota/video-based-generative-performance-2?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/zeroshot-video-question-answer-on-msrvtt-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msrvtt-qa?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/zeroshot-video-question-answer-on-tgif-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-tgif-qa?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/zeroshot-video-question-answer-on-msvd-qa)](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-msvd-qa?p=chat-univi-unified-visual-representation) 


[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/chat-univi-unified-visual-representation/science-question-answering-on-scienceqa)](https://paperswithcode.com/sota/science-question-answering-on-scienceqa?p=chat-univi-unified-visual-representation) 


💡 I also have other LLM projects that may interest you ✨. 


    

> [**MoH: Multi-Head Attention as Mixture-of-Head Attention**](https://github.com/SkyworkAI/MoH) 


> Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan 


[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/SkyworkAI/MoH)  [![github](https://img.shields.io/github/stars/SkyworkAI/MoH.svg?style=social)](https://github.com/SkyworkAI/MoH) [![arXiv](https://img.shields.io/badge/Arxiv-2410.11842-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.11842) 


    

> [**MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts**](https://github.com/SkyworkAI/MoE-plus-plus) 


> Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan 


[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/SkyworkAI/MoE-plus-plus)  [![github](https://img.shields.io/github/stars/SkyworkAI/MoE-plus-plus.svg?style=social)](https://github.com/SkyworkAI/MoE-plus-plus) [![arXiv](https://img.shields.io/badge/Arxiv-2410.07348-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.07348) 


-->



## 📣 News

* **[2024/04/05]** We've revised the temporal evaluation performance of video understanding, resulting in an actual model performance of 47.9 instead of the previously stated 57.8. We sincerely apologize for any inconvenience our oversight may have caused you.

* **[2024/04/05]** **Chat-UniVi** has been selected as a **Highlight** paper at CVPR 2024! (Top 3% of 11532 submissions).

* **[2024/02/27]** Our **Chat-UniVi** has been accepted by CVPR 2024!

* **[2024/01/05]**  We enhance the video loading code by [introducing support for variable-length videos](https://github.com/PKU-YuanGroup/Chat-UniVi/blob/d216cb52bff5ebf6e41eaa56d07a85568e294651/ChatUniVi/eval/model_video_general.py#L29). This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance ([Results](https://github.com/PKU-YuanGroup/Chat-UniVi?tab=readme-ov-file#videoqa)).

* **[2023/12/05]**  The visualization script is available at [VISUALIZATION.md](VISUALIZATION.md).

* **[2023/11/22]**  ⚡ The **online demo** is available at [Hugging Face Demo](https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi). Welcome to try!

* **[2023/11/22]**  The processed data is available at [DATA.md](DATA.md).

* **[2023/11/21]**  💡 We release [Chat-UniVi-13B](https://huggingface.co/Chat-UniVi/Chat-UniVi-13B/tree/main). Our proposed unified visual representation framework greatly reduces the number of visual tokens, so you can train **13B unified image and video understanding models** in full parameters directly on **8 A100 GPUs** within **3 days**. Chat-UniVi-13B has better performance ([Results](https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/results/Chat-UniVi-13B.md)). The training code for Chat-UniVi-13B has been updated ([TRAIN_AND_VALIDATE.md](TRAIN_AND_VALIDATE.md)).

* **[2023/11/21]**  We provide inference code for [video understanding](https://github.com/PKU-YuanGroup/Chat-UniVi/tree/main#inference-for-video-understanding) and [image understanding](https://github.com/PKU-YuanGroup/Chat-UniVi/tree/main#inference-for-image-understanding).

* **[2023/11/21]**  We enhance the video loading code by [introducing support for variable-length videos](https://github.com/PKU-YuanGroup/Chat-UniVi/blob/d216cb52bff5ebf6e41eaa56d07a85568e294651/ChatUniVi/eval/model_video_general.py#L29). This improvement involves eliminating the previous zero-filling operation on the video. We find that this updated video loading method significantly boosts performance.

* **[2023/11/15]**  Code are available now! Welcome to **watch** 👀 this repository for the latest updates.

## 😮 Highlights

### 💡 Unified visual representation for image and video

We employ **a set of dynamic visual tokens** to uniformly represent images and videos.

This representation framework empowers the model to efficiently utilize **a limited number of visual tokens** to simultaneously capture **the spatial details necessary for images** and **the comprehensive temporal relationship required for videos**.







### 🔥 Joint training strategy, making LLMs understand both image and video

Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications.







### 🤗 High performance, complementary learning with image and video

Extensive experimental results demonstrate that Chat-UniVi, as a unified model, consistently outperforms even existing methods exclusively designed for either images or videos.







## ⚡ Demo

Please change the model path on line 15 of the main_demo.py first. Then run the demo:

```

# For Chat-UniVi-7B

CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_7B:app --host 0.0.0.0 --port 8888

# For Chat-UniVi-13B

CUDA_VISIBLE_DEVICES=0 uvicorn main_demo_13B:app --host 0.0.0.0 --port 8888

```

### A conversation with both image and video







### A conversation includes multiple videos







### A conversation includes multiple images







### A conversation includes the video







### A conversation in Chinese

With translation API, our model can also support Chinese conversations. We will add code to support Chinese conversations in future updates.







## 🚀 Main Results

### Image understanding

Following LLaVA, we report the relative scores to GPT-4 for instruction-following questions.



    

        MethodsLLMConversationDetail DescriptionComplex ReasoningAll

    

    

        Chat-UniVi-7BVicuna-7B84.174.293.784.2

    

    

    

        Chat-UniVi-13BVicuna-13B84.179.494.786.1

    



### Video understanding

Following Video-ChatGPT, we report the relative scores between the output of the model and the ground truth, with the assistance of GPT. It is worth noting that the results reported in Video-ChatGPT span a range from 0 to 5. To standardize the metrics, we normalize all scores to a scale of 0 to 100.



    

        MethodsLLMCorrectDetailContextTemporalConsistency

    

    

        Chat-UniVi-7BVicuna-7B57.858.269.247.956.2

    

    

    

        Chat-UniVi-13BVicuna-13B59.459.870.5-60.6

    



### ScienceQA

We report both zero-shot and fine-tuning results on the ScienceQA test set. 



    

        MethodsLLMAverageSubjectContext ModalityGrade

    

    

        NATSOCLANTXTIMGNOG1-6G7-12

    

    

        Chat-UniVi-7BVicuna-7B88.7888.5093.0385.9188.5185.9788.1588.8888.60

    

    

    

        Chat-UniVi-13BVicuna-13B90.9990.4195.0588.9189.6488.0590.9491.1990.64

    



### VideoQA

We follow the evaluation protocol in Video-ChatGPT, i.e., employing GPT-assisted evaluation to assess the capabilities of models.



    

        MethodsLLM SizeMSRVTT-QAMSVD-QATGIF-QAActivityNet-QA

    

    

        AccuracyScoreAccuracyScoreAccuracyScoreAccuracyScore

    

    

        Video-LLaMA7B29.61.851.62.5--12.41.1

    

    

        LLaMA-Adapter7B43.82.754.93.1--34.22.7

    

    

        VideoChat7B45.02.556.32.834.42.326.52.2

    

    

        Video-ChatGPT7B49.32.864.93.351.43.035.22.7

    

    

        Video-LLaVA7B59.23.570.73.970.04.045.33.3

    

    

        Chat-UniVi-7B7B54.63.165.03.660.33.445.83.2

    

    

        Chat-UniVi-7B with new video loading code7B55.03.169.33.769.03.846.13.3

    

    

        Chat-UniVi-7B v1.57B57.53.268.83.770.03.847.23.3

    



### Hallucination Evaluation (POPE)

Our model also achieves impressive results in the object hallucination benchmark.



    

        MethodsLLM SizeRandomPopularAdversarial

    

    

        AccuracyF1-ScoreYesAccuracyF1-ScoreYesAccuracyF1-ScoreYes

    

    

        LLaVA7B72.1678.2276.2961.3771.5285.6358.6770.1288.33

    

    

        Video-LLaVA7B86.285.242.085.384.042.181.680.845.8

    

    

        Chat-UniVi-7B7B85.1986.0554.6769.5074.3969.1064.9771.5473.10

    

    

        Chat-UniVi-7B v1.57B87.0186.0941.8685.8784.7642.7383.2382.3144.77

    



## 😍 Visualization

### Visualization for the image inputs







### Visualization for the video inputs







## 🛠️ Requirements and Installation

Attention! If you are using a Windows system, please make sure to comment out ```deepspeed``` in pyproject.toml (#Line 20), as installing ```deepspeed``` may result in errors on Windows (see [Link](https://github.com/PKU-YuanGroup/Chat-UniVi/issues/2#issue-2007607645)). Keep in mind that ```deepspeed``` is intended for training models only. If you are solely engaged in inference and not training models, it is recommended to comment it out.

* Python >= 3.10

* Install required packages:

```bash

git clone https://github.com/PKU-YuanGroup/Chat-UniVi

cd Chat-UniVi

conda create -n chatunivi python=3.10 -y

conda activate chatunivi

pip install --upgrade pip

pip install -e .

# pip install ninja  # If you only intend to perform inference, there's no need to install ```ninja```.

# pip install flash-attn --no-build-isolation  # If you only intend to perform inference, there's no need to install ```flash-attn```.

```

## 🤖 API

**We open source all modalities preprocessing code.** If you want to load the model from the model hub on Hugging Face or on local, you can use the following code snippets.

### Inference for Video Understanding

```python

import torch

import os

from ChatUniVi.constants import *

from ChatUniVi.conversation import conv_templates, SeparatorStyle

from ChatUniVi.model.builder import load_pretrained_model

from ChatUniVi.utils import disable_torch_init

from ChatUniVi.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

from PIL import Image

from decord import VideoReader, cpu

import numpy as np

def _get_rawvideo_dec(video_path, image_processor, max_frames=MAX_IMAGE_LENGTH, image_resolution=224, video_framerate=1, s=None, e=None):

    # speed up video decode via decord.

    if s is None:

        start_time, end_time = None, None

    else:

        start_time = int(s)

        end_time = int(e)

        start_time = start_time if start_time >= 0. else 0.

        end_time = end_time if end_time >= 0. else 0.

        if start_time > end_time:

            start_time, end_time = end_time, start_time

        elif start_time == end_time:

            end_time = start_time + 1

    if os.path.exists(video_path):

        vreader = VideoReader(video_path, ctx=cpu(0))

    else:

        print(video_path)

        raise FileNotFoundError

    fps = vreader.get_avg_fps()

    f_start = 0 if start_time is None else int(start_time * fps)

    f_end = int(min(1000000000 if end_time is None else end_time * fps, len(vreader) - 1))

    num_frames = f_end - f_start + 1

    if num_frames > 0:

        # T x 3 x H x W

        sample_fps = int(video_framerate)

        t_stride = int(round(float(fps) / sample_fps))

        all_pos = list(range(f_start, f_end + 1, t_stride))

        if len(all_pos) > max_frames:

            sample_pos = [all_pos[_] for _ in np.linspace(0, len(all_pos) - 1, num=max_frames, dtype=int)]

        else:

            sample_pos = all_pos

        patch_images = [Image.fromarray(f) for f in vreader.get_batch(sample_pos).asnumpy()]

        patch_images = torch.stack([image_processor.preprocess(img, return_tensors='pt')['pixel_values'][0] for img in patch_images])

        slice_len = patch_images.shape[0]

        return patch_images, slice_len

    else:

        print("video path: {} error.".format(video_path))

if __name__ == '__main__':

    # Model Parameter

    model_path = "Chat-UniVi/Chat-UniVi"  # or "Chat-UniVi/Chat-UniVi-13B"、"Chat-UniVi/Chat-UniVi-v1.5"

    video_path = ${video_path}

    # The number of visual tokens varies with the length of the video. "max_frames" is the maximum number of frames.

    # When the video is long, we will uniformly downsample the video to meet the frames when equal to the "max_frames".

    max_frames = 100

    # The number of frames retained per second in the video.

    video_framerate = 1

    # Input Text

    qs = "Describe the video."

    # Sampling Parameter

    conv_mode = "simple"

    temperature = 0.2

    top_p = None

    num_beams = 1

    disable_torch_init()

    model_path = os.path.expanduser(model_path)

    model_name = "ChatUniVi"

    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name)

    mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)

    mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)

    if mm_use_im_patch_token:

        tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)

    if mm_use_im_start_end:

        tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)

    model.resize_token_embeddings(len(tokenizer))

    vision_tower = model.get_vision_tower()

    if not vision_tower.is_loaded:

        vision_tower.load_model()

    image_processor = vision_tower.image_processor

    if model.config.config["use_cluster"]:

        for n, m in model.named_modules():

            m = m.to(dtype=torch.bfloat16)

    # Check if the video exists

    if video_path is not None:

        video_frames, slice_len = _get_rawvideo_dec(video_path, image_processor, max_frames=max_frames, video_framerate=video_framerate)

        cur_prompt = qs

        if model.config.mm_use_im_start_end:

            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN * slice_len + DEFAULT_IM_END_TOKEN + '\n' + qs

        else:

            qs = DEFAULT_IMAGE_TOKEN * slice_len + '\n' + qs

        conv = conv_templates[conv_mode].copy()

        conv.append_message(conv.roles[0], qs)

        conv.append_message(conv.roles[1], None)

        prompt = conv.get_prompt()

        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(

            0).cuda()

        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2

        keywords = [stop_str]

        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

        with torch.inference_mode():

            output_ids = model.generate(

                input_ids,

                images=video_frames.half().cuda(),

                do_sample=True,

                temperature=temperature,

                top_p=top_p,

                num_beams=num_beams,

                output_scores=True,

                return_dict_in_generate=True,

                max_new_tokens=1024,

                use_cache=True,

                stopping_criteria=[stopping_criteria])

        output_ids = output_ids.sequences

        input_token_len = input_ids.shape[1]

        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()

        if n_diff_input_output > 0:

            print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')

        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]

        outputs = outputs.strip()

        if outputs.endswith(stop_str):

            outputs = outputs[:-len(stop_str)]

        outputs = outputs.strip()

        print(outputs)

```

### Inference for Image Understanding

```python

import torch

import os

from ChatUniVi.constants import *

from ChatUniVi.conversation import conv_templates, SeparatorStyle

from ChatUniVi.model.builder import load_pretrained_model

from ChatUniVi.utils import disable_torch_init

from ChatUniVi.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria

from PIL import Image

if __name__ == '__main__':

    # Model Parameter

    model_path = "Chat-UniVi/Chat-UniVi"  # or "Chat-UniVi/Chat-UniVi-13B"、"Chat-UniVi/Chat-UniVi-v1.5"

    image_path = ${image_path}

    # Input Text

    qs = "Describe the image."

    # Sampling Parameter

    conv_mode = "simple"

    temperature = 0.2

    top_p = None

    num_beams = 1

    disable_torch_init()

    model_path = os.path.expanduser(model_path)

    model_name = "ChatUniVi"

    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name)

    mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)

    mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)

    if mm_use_im_patch_token:

        tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)

    if mm_use_im_start_end:

        tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)

    model.resize_token_embeddings(len(tokenizer))

    vision_tower = model.get_vision_tower()

    if not vision_tower.is_loaded:

        vision_tower.load_model()

    image_processor = vision_tower.image_processor

    # Check if the video exists

    if image_path is not None:

        cur_prompt = qs

        if model.config.mm_use_im_start_end:

            qs = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + qs

        else:

            qs = DEFAULT_IMAGE_TOKEN + '\n' + qs

        conv = conv_templates[conv_mode].copy()

        conv.append_message(conv.roles[0], qs)

        conv.append_message(conv.roles[1], None)

        prompt = conv.get_prompt()

        input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()

        image = Image.open(image_path)

        image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

        stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2

        keywords = [stop_str]

        stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

        with torch.inference_mode():

            output_ids = model.generate(

                input_ids,

                images=image_tensor.unsqueeze(0).half().cuda(),

                do_sample=True,

                temperature=temperature,

                top_p=top_p,

                num_beams=num_beams,

                max_new_tokens=1024,

                use_cache=True,

                stopping_criteria=[stopping_criteria])

        input_token_len = input_ids.shape[1]

        n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()

        if n_diff_input_output > 0:

            print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')

        outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]

        outputs = outputs.strip()

        if outputs.endswith(stop_str):

            outputs = outputs[:-len(stop_str)]

        outputs = outputs.strip()

        print(outputs)

```

## 🗝️ Training & Validating

* The data instruction is in [DATA.md](DATA.md).

* The training instruction is in [TRAIN_AND_VALIDATE.md](TRAIN_AND_VALIDATE.md).

## 👍 Acknowledgement

* [LLaVA](https://github.com/haotian-liu/LLaVA) The codebase we built upon and it is an efficient large language and vision assistant.

* [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT) Great job contributing the evaluation code and dataset.

## 🤝 Related Projects

* [Video-LLaVA](https://github.com/PKU-YuanGroup/Video-LLaVA) This framework exhibits remarkable interactive capabilities between images and videos.

## 🔒 License

* The majority of this project is released under the Apache 2.0 license as found in the [LICENSE](https://github.com/PKU-YuanGroup/Chat-UniVi/blob/main/LICENSE) file.

* The service is a research preview intended for non-commercial use only, subject to the model [License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA, [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and [Privacy Practices](https://chrome.google.com/webstore/detail/sharegpt-share-your-chatg/daiacboceoaocpibfodeljbdfacokfjb) of ShareGPT. Please contact us if you find any potential violations.

## ✏️ Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

```

@article{jin2023chatunivi,

  title={Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding}, 

  author={Peng Jin and Ryuichi Takanobu and Caiwan Zhang and Xiaochun Cao and Li Yuan},

  journal={arXiv preprint arXiv:2311.08046},

  year={2023}

}

```

## ✨ Contributors