Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://mbzuai-oryx.github.io/VideoGLaMM/
https://mbzuai-oryx.github.io/VideoGLaMM/
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://mbzuai-oryx.github.io/VideoGLaMM/
- Owner: mbzuai-oryx
- Created: 2024-10-31T12:00:44.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-07T21:12:55.000Z (3 months ago)
- Last Synced: 2024-11-07T21:32:34.560Z (3 months ago)
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- Awesome-Segment-Anything - [code
README
# VideoGLaMM
![](https://i.imgur.com/waxVImv.png)[Shehan Munasinghe](https://github.com/shehanmunasinghe) , [Hanan Gani](https://github.com/hananshafi) , [Wenqi Zhu](#) , [Jiale Cao](https://jialecao001.github.io/), [Eric Xing](https://www.cs.cmu.edu/~epxing/), [Fahad Shahbaz Khan](https://scholar.google.es/citations?user=zvaeYnUAAAAJ&hl=en). [Salman Khan](https://salman-h-khan.github.io/),
**Mohamed bin Zayed University of Artificial Intelligence, Tianjin University,
LinkΓΆping University, Australian National University, Carnegie Mellon University**[![Website](https://img.shields.io/badge/Project-Website-87CEEB)](https://mbzuai-oryx.github.io/VideoGLaMM/)
[![paper](https://img.shields.io/badge/arXiv-Paper-.svg)](https://arxiv.org/abs/2411.04923)---
## π’ Latest Updates
- π¦ Code, checkpoints will be released soon. Stay tuned!
---## Overview
VideoGLaMM is a large video multimodal video model capable of pixel-level visual grounding. The model responds to natural language queries from the user and intertwines spatio-temporal object masks in its generated textual responses to provide a detailed understanding of video content. VideoGLaMM seamlessly connects three key components: a Large Language Model (LLM); dual vision encoders; and a spatio-temporal pixel decoder. The dual vision encoders extract spatial and temporal features separately, which are jointly passed to the LLM to output responses rich in both spatial and temporal cues. This is facilitated by end-to-end training on our proposed benchmark Grounded conversation Generation (GCG) dataset featuring 38k Video-QA triplets with 87k objects and 671k fine-grained masks.
---
## π Highlights
1. We introduce Video Grounded Large Multi-modal Model (VideoGLaMM), a video large multimodal model, capable of pixel-level visual grounding, featuring an end-to-end alignment mechanism.2. To achieve fine-grained spatio-temporal alignment, we introduce a benchmark grounded conversation generation (GCG) dataset consisting of 38k grounded video-QA triplet pairs and 83k objects and roughly 671k fine-grained spatio-temporal masks.
3. We assess the performance of VideoGLaMM across diverse tasks spanning grounded conversation generation, visual grounding, and referring video segmentation, where it achieves state-of-the-art performance
---
## Architecture
VideoGLaMM consists of following key components: (i) Spatio-Temporal Dual Encoder, (ii) Dual Alignment V-L Adapters for image and video features, (iii) Large Language Model (LLM) iv) L-V Adapter and (iv) Promptable Pixel Decoder.
---
## Benchmark and Annotation Pipeline
We propose a semi-automatic annotation pipeline for creating a grounded conversation generation (GCG) dataset for videos.
---
## Examples πGiven user queries, the VideoGLaMM generates textual responses and grounds objects and phrases using pixel-level masks, showing its detailed understanding of the video.
---
## Citation π
```bibtex
@article{munasinghe2024videoglamm,
title={VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos},
author={Shehan Munasinghe and Hanan Gani and Wenqi Zhu and Jiale Cao and Eric Xing and Fahad Khan and Salman Khan},
journal={ArXiv},
year={2024},
url={https://arxiv.org/abs/2411.04923}
}
```---
[](https://www.ival-mbzuai.com)
[](https://github.com/mbzuai-oryx)
[](https://mbzuai.ac.ae)