Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tootouch/reveca
Generic Event Boundary Captioning (GEBC) Challenge at LOVEU@CVPR 2022 - 3rd place (REVECA)
https://github.com/tootouch/reveca
Last synced: 2 days ago
JSON representation
Generic Event Boundary Captioning (GEBC) Challenge at LOVEU@CVPR 2022 - 3rd place (REVECA)
- Host: GitHub
- URL: https://github.com/tootouch/reveca
- Owner: TooTouch
- License: mit
- Created: 2022-05-10T15:58:51.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-02-17T07:01:47.000Z (over 1 year ago)
- Last Synced: 2024-05-20T15:49:14.103Z (6 months ago)
- Language: Python
- Homepage:
- Size: 117 MB
- Stars: 26
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Generic Boundary Event Captioning Challenge at CVPR 2022 LOVEU workshop [[paper](https://arxiv.org/abs/2206.09178)]
*Jaehyuk Heo, YongGi Jeong, Sunwoo Kim, Jaehee Kim, Pilsung Kang*
*School of Industrial & Management Engineering, Korea University*
*Seoul, Korea*We propose the Rich Encoder-decoder framework for Video Event Captioner (REVECA). Our model achieves 3rd place in [GEBC Challenge](https://codalab.lisn.upsaclay.fr/competitions/4157#results).
# Environments
1. Build a docker image and make a docker container
```bash
cd docker
bash docker_build.sh $image_name
```2. Install packages
```bash
pip install -r requirements
```# Datasets
Download Kinetics-GEBC and annotations in [here](https://sites.google.com/view/loveucvpr22/home?authuser=0). And save files in `./datasets`
```
datasets/
└── annotations
├── testset_highest_f1.json
├── trainset_highest_f1.json
├── valset_highest_f1.json
```Our model uses three video features: instance segmentation mask, TSN features
1. We use the semantic segmentation mask for the training model. The segmentation model is [Mask2Former](https://github.com/facebookresearch/Mask2Former).
![](https://github.com/TooTouch/REVECA/blob/main/assets/run_with_seg.gif)
2. We use TSN features extracted by Temporal Segmentation Networks. TSN features released in GEBC Challenge can download [here](https://drive.google.com/drive/folders/1kOauKJY4MphWJhjYcXcCcdmP-071Fu6D?usp=sharing).
# Methods
Our video understanding model is called REVECA, based on CoCa. We use three methods: (1) Temporal-based Pairwise Difference (TPD), (2) Frame position embedding, and (3) LoRA. we use timm version == 0.6.2.dev0 and `loralib`. And then, we modify a `vision_transformer.py` for using LoRA.
# Results
Method | Avg. | CIDEr | SPICE | ROUGE-L
---|---|---|---|---
CNN+LSTM | 29.94 | 49.73 | 13.62 | 26.46
Robust Change Captioning | 34.16 | 58.56 | 16.34 | 27.57
UniVL-revised | 36.64 | 65.74 | 18.06 | 26.12
ActBERT-revised | 40.80 | 74.71 | 19.52 | 28.15
**REVECA (our model)** | **50.97** | **93.91** | **24.66** | **34.34**# Saved Model
Our final model weights can download [here](https://drive.google.com/file/d/1sQZXg5-L6i5l6brCyu5HCsaoRvlVSiuO/view?usp=sharing).
# Citation