An open API service indexing awesome lists of open source software.

https://github.com/xaxm007/video-captioning-transformer

For understanding working of transformer.
https://github.com/xaxm007/video-captioning-transformer

deep-learning note progress transformer video-captioning video-captioning-transformer

Last synced: 6 months ago
JSON representation

For understanding working of transformer.

Awesome Lists containing this project

README

          

# Video-Captioning-Transformer
For transformer understanding (This is just a note for me to implement the original project).

# Video Captioning Transformer Project

This project aims to generate captions for videos using a Transformer model. The project integrates multiple repositories, datasets, and pre-trained models to create a comprehensive video captioning solution. Below is a detailed guide on setting up and using the project.

## Table of Contents

1. [Repositories](#repositories)
2. [Datasets](#datasets)
3. [Pre-trained Models](#pre-trained-models)
4. [Dependencies](#dependencies)
5. [Setup Instructions](#setup-instructions)
6. [Usage](#usage)
7. [Notes](#notes)

## Repositories

### Main Repositories

- **Video-Captioning-Transformer**
- Repository: [Video-Captioning-Transformer](https://github.com/Kamino666/Video-Captioning-Transformer/tree/master)
- Description: Transformer model for video captioning.

- **Video-Features**
- Repository: [Video-Features](https://github.com/Kamino666/video_features/tree/master)
- Description: Repository for extracting video features.

## Datasets

- **Dataloader**
- Repository: [MSVD Dataloader](https://github.com/albanie/collaborative-experts/blob/master/misc/datasets/msvd/README.md)
- Description: Dataloader for MSVD dataset.

- **Baidu Dataset**
- Link: [Baidu MSRVTT and MSVD Dataset](https://pan.baidu.com/s/1xG5F856VNEjNXD6JcG_4NA?pwd=aupi#list/path=%2Fsharelink3411495947-318895376070041%2FMSRVTT%20and%20MSVD&parentPath=%2Fsharelink3411495947-318895376070041)
- Password: `aupi`
- Description: MSRVTT and MSVD datasets available for download.

## Pre-trained Models

- **CLIP4Clip Model**
- Model File: [clip4clip_msrvtt.pth](https://drive.google.com/file/d/1-aA6Zc-cK38TjC0JPfbttE009Bh3BtG_/view)
- Paper: [CLIP4Clip Paper](https://arxiv.org/pdf/2104.08860)
- Repository: [CLIP4Clip Repo](https://github.com/ArrowLuo/CLIP4Clip?tab=readme-ov-file)

- **I3D Model**
- Repository: [ID3 Model](https://github.com/hassony2/kinetics_i3d_pytorch)
- Description: Pre-trained I3D model for extracting video features.

## Dependencies

- **mmcv**
- Installation Guide: [mmcv Installation](https://mmcv.readthedocs.io/en/latest/get_started/installation.html)
- Note: Follow the instructions carefully to avoid errors.

## Setup Instructions

### 1. Create Conda Environment

```sh
conda create -n video_captioning python=3.8
conda activate video_captioning
```

To ensure the project runs smoothly, follow these additional steps:

### Setting Up Data Loaders

1. Navigate to the `Video-Captioning-Transformer` repository.

2. Configure the data loader to use the MSVD dataset:
- Edit the configuration file to set the path to your MSVD dataset.
- Example:
```yaml
dataset:
name: MSVD
path: /path/to/your/MSVD/dataset
```

3. Configure the data loader to use the MSRVTT dataset:
- Edit the configuration file to set the path to your MSRVTT dataset.
- Example:
```yaml
dataset:
name: MSRVTT
path: /path/to/your/MSRVTT/dataset
```

### Training the Model

1. Ensure you are in the `Video-Captioning-Transformer` directory.

2. Run the training script with the appropriate configuration:
```sh
python train.py --config configs/train_config.yaml
```
### Additional Transformer Repositories

In addition to the main repositories, the project also integrates the following repositories for enhanced transformer capabilities:

- **BMT (Bidirectional Multimodal Transformer)**
- Repository: [BMT](https://github.com/v-iashin/BMT)
- Description: Bidirectional Multimodal Transformer for multimodal tasks.

- **MDVC (Modality Distillation with Visual Concept)**
- Repository: [MDVC](https://github.com/v-iashin/MDVC)
- Description: Repository for modality distillation with visual concepts.

These repositories offer additional transformer architectures and functionalities, further enhancing the capabilities of the video captioning transformer model.