Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tk-21st/devan

[ACL24] DeVAn: Dense Video Annotation for Video-Language Models
https://github.com/tk-21st/devan

Last synced: 2 months ago
JSON representation

[ACL24] DeVAn: Dense Video Annotation for Video-Language Models

Host: GitHub
URL: https://github.com/tk-21st/devan
Owner: TK-21st
Created: 2024-02-14T22:20:45.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-10-21T18:35:18.000Z (3 months ago)
Last Synced: 2024-10-22T10:26:48.800Z (3 months ago)
Language: Python
Homepage: https://www.tingkai-liu.org/DeVAn/
Size: 52.4 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # DeVAn: Dense Video Annotation for Video-Language Models

This repository contains code and data related to submission to ACL ARR 2024 - _DeVAn: Dense Video Annotation for Video-Language Models_.

For more details on the dataset and example videos and annotations, refer to our [website](https://tk-21st.github.io/DeVAn/).

## Data

Please see ~`data/devan_acl24_release_v1.jsonl.gz`~ `data/devan_acl24_release_v1.1.jsonl.gz` for V1 of release data. 

V1 release contains 8K videos, their corresponding youtube id, start/end timestamp, ground truth captions/summaries and predicted caption/summaries from models evaluated.

Refer to the data [README](https://github.com/TK-21st/DeVAn/blob/main/data/README.md) for instructions of downloading and post-processing videos from Youtube.

## Evaluation

### Inference

To reproduce inference results reported in the manuscript, please first ensure that you have downloaded the video data following steps listed above.

1. Video-ChatGPT: Please follow descriptions in [official repo](https://github.com/mbzuai-oryx/Video-ChatGPT), then modify `inference/videochatgpt/infer.py` based on your input/output/ckpt paths.

2. ImageBind-LLM: Please follow descriptions in [official repo](https://github.com/OpenGVLab/LLaMA-Adapter/tree/main/imagebind_LLM), then modify `inference/infer_imagebind_llm.py` based on your input/output/ckpt paths.

3. Video-LLaMA: Please follow descriptions in [official repo](https://github.com/DAMO-NLP-SG/Video-LLaMA), then modify `inference/infer_videollama.py` based on your input/output/ckpt paths.

4. VideoCoCa: Code and ckpt for our finetuned VideoCoCa model will be released upon publication.

### Evaluation

To reproduce the evaluation results reported in the manuscript, please first ensure that you have completed the inference step above.

For generation tasks, refer to `evaluation/ngram_metrics.py` and `evaluation/model_based_metrics.py` to compute N-gram based and model-based metrics for model inference results.

## Leaderboard

We evaluate a range of different models on both video-to-text generation and text-to-video retrieval tasks.

### Caption Tasks

|           Model           | Audio | BLEU-4 | ROUGE-L | CIDEr | BLEURT | R@1 | R@5 | R@10 |

|:-------------------------:|:-----:|:------:|:-------:|:-----:|:------:|:---:|:---:|:----:|

|        Human (Avg)        |  Raw  |   6.3  |   32.1  |  53.9 |  50.5  |  -  |  -  |   -  |

|        Human (Min)        |  Raw  |   4.5  |   29.5  |  47.1 |  48.6  |  -  |  -  |   -  |

|       ImageBind-LLM       |  N/A  |   0.3  |   20.0  |  2.1  |  34.0  |  -  |  -  |   -  |

| Video-LLaMA2-Instruct 13B |  N/A  |   0.1  |   7.9   |  0.0  |  47.2  |  -  |  -  |   -  |

| Video-LLaMA2-Instruct 13B |  Raw  |   0.1  |   7.9   |  0.0  |  47.1  |  -  |  -  |   -  |

|  Video-LLaMA2-Instruct 7B |  N/A  |   0.1  |   10.8  |  0.0  |  43.6  |  -  |  -  |   -  |

|  Video-LLaMA2-Instruct 7B |  Raw  |   0.1  |   10.8  |  0.0  |  43.6  |  -  |  -  |   -  |

|        VideoChatGPT       |  N/A  |   0.4  |   19.9  |  2.0  |  40.5  |  -  |  -  |   -  |

|         VideoCoCa         |  N/A  |   0.2  |   13.2  |  2.3  |  17.6  | 32% | 50% |  58% |

|         VideoCoCa         |  ASR  |   0.8  |   20.3  |  9.2  |  21.9  | 36% | 53% |  59% |

### Summary Tasks

|           Model           | Audio | BLEU-4 | ROUGE-L | CIDEr | BLEURT | R@1 | R@5 | R@10 |

|:-------------------------:|:-----:|:------:|:-------:|:-----:|:------:|:---:|:---:|:----:|

|        Human (Avg)        |  Raw  |  15.7  |   34.5  |  36.9 |  55.6  |  -  |  -  |   -  |

|        Human (Min)        |  Raw  |  12.4  |   32.1  |  30.9 |  53.6  |  -  |  -  |   -  |

|       ImageBind-LLM       |  N/A  |   1.5  |   22.7  |  1.1  |  45.8  |  -  |  -  |   -  |

| Video-LLaMA2-Instruct 13B |  N/A  |   0.5  |   18.2  |  0.0  |  39.9  |  -  |  -  |   -  |

| Video-LLaMA2-Instruct 13B |  Raw  |   0.5  |   18.2  |  0.0  |  40.0  |  -  |  -  |   -  |

|  Video-LLaMA2-Instruct 7B |  N/A  |   0.5  |   19.1  |  0.0  |  43.9  |  -  |  -  |   -  |

|  Video-LLaMA2-Instruct 7B |  Raw  |   0.5  |   19.1  |  0.1  |  43.9  |  -  |  -  |   -  |

|        VideoChatGPT       |  N/A  |   2.9  |   24.4  |  5.8  |  46.7  |  -  |  -  |   -  |

|         VideoCoCa         |  N/A  |   0.9  |   16.4  |  3.3  |  23.9  | 25% | 41% |  48% |

|         VideoCoCa         |  ASR  |   2.0  |   21.6  |  5.5  |  22.9  | 27% | 42% |  48% |