https://github.com/xaxm007/video-captioning-transformer
For understanding working of transformer.
https://github.com/xaxm007/video-captioning-transformer
deep-learning note progress transformer video-captioning video-captioning-transformer
Last synced: 6 months ago
JSON representation
For understanding working of transformer.
- Host: GitHub
- URL: https://github.com/xaxm007/video-captioning-transformer
- Owner: xaxm007
- Created: 2024-06-10T09:49:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-29T16:07:38.000Z (over 1 year ago)
- Last Synced: 2025-02-09T13:44:31.360Z (8 months ago)
- Topics: deep-learning, note, progress, transformer, video-captioning, video-captioning-transformer
- Homepage:
- Size: 11.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Video-Captioning-Transformer
For transformer understanding (This is just a note for me to implement the original project).# Video Captioning Transformer Project
This project aims to generate captions for videos using a Transformer model. The project integrates multiple repositories, datasets, and pre-trained models to create a comprehensive video captioning solution. Below is a detailed guide on setting up and using the project.
## Table of Contents
1. [Repositories](#repositories)
2. [Datasets](#datasets)
3. [Pre-trained Models](#pre-trained-models)
4. [Dependencies](#dependencies)
5. [Setup Instructions](#setup-instructions)
6. [Usage](#usage)
7. [Notes](#notes)## Repositories
### Main Repositories
- **Video-Captioning-Transformer**
- Repository: [Video-Captioning-Transformer](https://github.com/Kamino666/Video-Captioning-Transformer/tree/master)
- Description: Transformer model for video captioning.- **Video-Features**
- Repository: [Video-Features](https://github.com/Kamino666/video_features/tree/master)
- Description: Repository for extracting video features.## Datasets
- **Dataloader**
- Repository: [MSVD Dataloader](https://github.com/albanie/collaborative-experts/blob/master/misc/datasets/msvd/README.md)
- Description: Dataloader for MSVD dataset.- **Baidu Dataset**
- Link: [Baidu MSRVTT and MSVD Dataset](https://pan.baidu.com/s/1xG5F856VNEjNXD6JcG_4NA?pwd=aupi#list/path=%2Fsharelink3411495947-318895376070041%2FMSRVTT%20and%20MSVD&parentPath=%2Fsharelink3411495947-318895376070041)
- Password: `aupi`
- Description: MSRVTT and MSVD datasets available for download.## Pre-trained Models
- **CLIP4Clip Model**
- Model File: [clip4clip_msrvtt.pth](https://drive.google.com/file/d/1-aA6Zc-cK38TjC0JPfbttE009Bh3BtG_/view)
- Paper: [CLIP4Clip Paper](https://arxiv.org/pdf/2104.08860)
- Repository: [CLIP4Clip Repo](https://github.com/ArrowLuo/CLIP4Clip?tab=readme-ov-file)- **I3D Model**
- Repository: [ID3 Model](https://github.com/hassony2/kinetics_i3d_pytorch)
- Description: Pre-trained I3D model for extracting video features.## Dependencies
- **mmcv**
- Installation Guide: [mmcv Installation](https://mmcv.readthedocs.io/en/latest/get_started/installation.html)
- Note: Follow the instructions carefully to avoid errors.## Setup Instructions
### 1. Create Conda Environment
```sh
conda create -n video_captioning python=3.8
conda activate video_captioning
```To ensure the project runs smoothly, follow these additional steps:
### Setting Up Data Loaders
1. Navigate to the `Video-Captioning-Transformer` repository.
2. Configure the data loader to use the MSVD dataset:
- Edit the configuration file to set the path to your MSVD dataset.
- Example:
```yaml
dataset:
name: MSVD
path: /path/to/your/MSVD/dataset
```3. Configure the data loader to use the MSRVTT dataset:
- Edit the configuration file to set the path to your MSRVTT dataset.
- Example:
```yaml
dataset:
name: MSRVTT
path: /path/to/your/MSRVTT/dataset
```### Training the Model
1. Ensure you are in the `Video-Captioning-Transformer` directory.
2. Run the training script with the appropriate configuration:
```sh
python train.py --config configs/train_config.yaml
```
### Additional Transformer RepositoriesIn addition to the main repositories, the project also integrates the following repositories for enhanced transformer capabilities:
- **BMT (Bidirectional Multimodal Transformer)**
- Repository: [BMT](https://github.com/v-iashin/BMT)
- Description: Bidirectional Multimodal Transformer for multimodal tasks.- **MDVC (Modality Distillation with Visual Concept)**
- Repository: [MDVC](https://github.com/v-iashin/MDVC)
- Description: Repository for modality distillation with visual concepts.These repositories offer additional transformer architectures and functionalities, further enhancing the capabilities of the video captioning transformer model.