{"id":13665184,"url":"https://github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition","last_synced_at":"2025-04-26T08:31:43.273Z","repository":{"id":166715682,"uuid":"277030962","full_name":"FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition","owner":"FingerRec","description":"[Arxiv2020] The code for our paper 《Self-Supervised Temporal-Discriminative Representation Learning for Video Action Recognition》 https://arxiv.org/abs/2008.02129","archived":false,"fork":false,"pushed_at":"2020-09-19T03:21:00.000Z","size":7134,"stargazers_count":77,"open_issues_count":0,"forks_count":5,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-08-03T06:01:45.162Z","etag":null,"topics":["representation-learning","self-supervised-learning","unsupervised","video-action-recognition"],"latest_commit_sha":null,"homepage":"https://zhuanlan.zhihu.com/p/176774543","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FingerRec.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-07-04T03:26:25.000Z","updated_at":"2024-04-23T19:19:11.000Z","dependencies_parsed_at":"2023-07-28T15:16:04.184Z","dependency_job_id":null,"html_url":"https://github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition","commit_stats":null,"previous_names":["fingerrec/self-supervised-temporal-discriminative-representation-learning-for-video-action-recognition"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FingerRec%2FSelf-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FingerRec%2FSelf-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FingerRec%2FSelf-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FingerRec%2FSelf-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FingerRec","download_url":"https://codeload.github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224031926,"owners_count":17244361,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["representation-learning","self-supervised-learning","unsupervised","video-action-recognition"],"created_at":"2024-08-02T06:00:25.893Z","updated_at":"2024-11-11T00:30:31.051Z","avatar_url":"https://github.com/FingerRec.png","language":"Python","funding_links":[],"categories":["2020"],"sub_categories":["Arxiv (with code or interesting)"],"readme":"# Self-Supervised Temporal-Discriminative Representation Learning\n\nThe source code for our paper \n\n\"Self-Supervised Temporal-Discriminative Representation\nLearning for Video Action Recognition\" [paper](https://arxiv.org/abs/2008.02129)\n\n## Overview\n\n**Without one label available, our method learn to focus on motion region powerful!**\n\n![example](https://github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition/raw/master/resources/example.gif)\n\n\u003e Our self-supervised VTDL signifcantly outperforms existing\nself-supervised learning method in video action recognition, even achieve better result than fully-supervised methods on UCF101 and HMDB51\nwhen a small-scale video dataset (with only thousands of videos) is\nused for pre-training!\n\n![sample_acc.png](https://i.loli.net/2020/07/04/WENzTnSv8f6cLyR.png)\n\n### Requirements\n- Python3\n- pytorch1.1+\n- PIL\n\n## Structure\n- datasets\n    - list\n        - hmdb51: the train/val lists of HMDB51\n        - ucf101: the train/val lists of UCF101\n        - kinetics-400: the train/val lists of kinetics-400\n- experiments\n    - logs: experiments record in detials\n    - TemporalDis\n        - hmdb51\n        - ucf101\n        - kinetics\n    - gradientes: \n    - visualization\n- src\n    - data: load data\n    - loss: the loss evluate in this paper\n    - model: network architectures\n    - scripts: train/eval scripts\n    - TC: detail implementation of Spatio-temporal consistency\n    - utils\n    - feature_extract.py\n    - main.py\n    - trainer.py\n    - option.py\n## Dataset\n\nLook [dataset.md](https://github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition/blob/master/dataset.md). Prepare dataset in txt file, and each row of txt is as below:\nThe split of hmdb51/ucf101/kinetics-400 can be download from \n[google driver](https://drive.google.com/drive/folders/1gOOOlBJtH1p2weC-k7TD3HcC2HSHBryw?usp=sharing).\n\nEach item include\n\u003e video_path class frames_num\n\n## VTDL\n### Network Architecture\nThe network is in the folder **src/model/[backbone].py**\n\n|  Method   | #logits_channel  |\n|  ----  | ----  | \n| C3D  | 512 |\n| R2P1D  | 2048 |\n| I3D | 1024 |\n| R3D  | 2048 |\n\n\n### Step1: self-supervised learning\n#### HMDB51\n```bash\nbash scripts/TemporalDisc/hmdb51.sh\n```\n#### UCF101\n```bash\nbash scripts/TemporalDisc/ucf101.sh\n```\n#### Kinetics-400\n```bash\nbash scripts/TemporalDisc/kinetics.sh\n```\n\n**Notice: More Training Options and ablation study Can be find in scripts**\n\n### Step2: Transfer to action recognition\n#### HMDB51\n```bash\n#!/usr/bin/env bash\npython main.py \\\n--method ft \\\n--train_list ../datasets/lists/hmdb51/hmdb51_rgb_train_split_1.txt \\\n--val_list ../datasets/lists/hmdb51/hmdb51_rgb_val_split_1.txt \\\n--dataset hmdb51 \\\n--arch i3d \\\n--mode rgb \\\n--lr 0.001 \\\n--lr_steps 10 20 25 30 35 40 \\\n--epochs 45 \\\n--batch_size 4 \\\n--data_length 64 \\\n--workers 8 \\\n--dropout 0.5 \\\n--gpus 2 \\\n--logs_path ../experiments/logs/hmdb51_i3d_ft \\\n--print-freq 100 \\\n--weights ../experiments/TemporalDis/hmdb51/models/04-16-2328_aug_CJ/ckpt_epoch_48.pth\n```\n#### UCF101\n```bash\n#!/usr/bin/env bash\npython main.py \\\n--method ft \\\n--train_list ../datasets/lists/ucf101/ucf101_rgb_train_split_1.txt \\\n--val_list ../datasets/lists/ucf101/ucf101_rgb_val_split_1.txt \\\n--dataset ucf101 \\\n--arch i3d \\\n--mode rgb \\\n--lr 0.0005 \\\n--lr_steps 10 20 25 30 35 40 \\\n--epochs 45 \\\n--batch_size 4 \\\n--data_length 64 \\\n--workers 8 \\\n--dropout 0.5 \\\n--gpus 2 \\\n--logs_path ../experiments/logs/ucf101_i3d_ft \\\n--print-freq 100 \\\n--weights ../experiments/TemporalDis/ucf101/models/04-18-2208_aug_CJ/ckpt_epoch_45.pth\n```\n**Notice: More Training Options and ablation study Can be find in scripts**\n\n## Results\n### Step2:Transfer\n\nWith same experiment setting, the result is reported below:\n\n|  Method   | UCF101  | HMDB51 |\n|  ----  | ----  | ---- |\n| Baseline  | 60.3 | 22.6| \n| + BA |63.3 | 26.2|\n| + Temporal Discriminative  | 72.7 | 41.2| \n| + TCA | 82.3 |52.9|\n\n#### trained models/logs/performance\n\nWe provided trained models/logs/performance in google driver.\n##### Baseline + BA\n\n![BA_fine_tune_performance.png](https://i.loli.net/2020/07/04/V1YzdPhxjJARnKS.png)\n\n[performance](https://drive.google.com/file/d/13O7JpIGYxspgOsJTCKLh37ZeKR2slPgz/view?usp=sharing);\n\n[trained_model](https://drive.google.com/file/d/10J5fKKkDF58njdsLGXkknk0RWkaINJ31/view?usp=sharing); \n\n[logs](https://drive.google.com/file/d/1K1_U692NZq5F53DDPDkpw7JFiEqIqole/view?usp=sharing)\n\n##### Baseline + BA + Temporal Discriminative\n\n![wo_TCA_fine_tune_performance.png](https://i.loli.net/2020/07/04/ZYkEIRcMqxzD6AX.png)\n\n[performance](https://drive.google.com/file/d/1ZHUhFsHoyyIWTnoDB1XEG2WBJLgxTtA9/view?usp=sharing);\n\n[trained_model](https://drive.google.com/file/d/1HJQwzRwNs5nOseAnPW87iDpUMXOs78a8/view?usp=sharing);\n\n[logs](https://drive.google.com/file/d/1Q-tq9bf-J8caHxJJDMlXo4KX5cNJ_75w/view?usp=sharing)\n\n##### Baseline + BA + Temporal Discriminative + TCA\n\n**(a). Pretrain**\n\nLoss curve:\n\n![loss.png](https://i.loli.net/2020/09/18/4dMhnxJtjupE1QH.png)\n\nIns Prob:\n\n![prob.png](https://i.loli.net/2020/09/18/4QIj5PZR1pi3rXn.png)\n\n[pretrained_weight](https://drive.google.com/file/d/1g1reGkcD2xwztzwfGRPLzz0JMOP3yo7a/view?usp=sharing)\n\nThis pretrained model can achieve 52.7% on HMDB51.\n\n**(b). Finetune**\n\n\n![VTDL_fine_tune_performance.png](https://i.loli.net/2020/07/04/14TlPxKcvOyLguM.png)\n\n[performance](https://drive.google.com/file/d/16GL88PLLOpLoO_yWjK2XcYPKKnqRBtcr/view?usp=sharing);\n\n[trained_model](https://drive.google.com/file/d/1TdxIBrdLcgKabL5A_FusomWLe4oXDcDK/view?usp=sharing);\n\n[logs](https://drive.google.com/file/d/13vfdQusv2Gd42nYXe4p1nWEByIJ9cLoz/view?usp=sharing)\n\n\nThe result is report with single video clip. In the test, we will average ten clips as final predictions. Will lead to around 2-3% improvement.\n```bash\npython test.py\n```\n\n\n## Feature Extractor\nAs STCR can be easily extend to other video representation task, we offer the scripts to perform feature extract.\n```bash\npython feature_extractor.py\n```\n\n\nThe feature will be saved as a single numpy file in the format [video_nums,features_dim]\n\n\n## Citation\n\nPlease cite our paper if you find this code useful for your research.\n\n```\n@Article{wang2020self,\n  author  = {Jinpeng Wang and Yiqi Lin and Andy J. Ma and Pong C. Yuen},\n  title   = {Self-supervised Temporal Discriminative Learning for Video Representation Learning},\n  journal = {arXiv preprint arXiv:2008.02129},\n  year    = {2020},\n}\n```\n\n\n## Others\n\nThe project is partly based on [Unsupervised Embedding Learning](https://github.com/mangye16/Unsupervised_Embedding_Learning) and [MOCO](https://github.com/facebookresearch/moco).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFingerRec%2FSelf-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFingerRec%2FSelf-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFingerRec%2FSelf-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition/lists"}