{"id":28447854,"url":"https://github.com/opengvlab/perception_test_iccv2023","last_synced_at":"2025-06-30T15:32:10.720Z","repository":{"id":199419865,"uuid":"702331229","full_name":"OpenGVLab/perception_test_iccv2023","owner":"OpenGVLab","description":"Champion Solutions repository for Perception Test challenges in ICCV2023 workshop.","archived":false,"fork":false,"pushed_at":"2023-10-18T11:52:44.000Z","size":17725,"stargazers_count":13,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-06T12:07:22.177Z","etag":null,"topics":["audio-visual","deep-learning","iccv2023"],"latest_commit_sha":null,"homepage":"https://ptchallenge-workshop.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenGVLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-10-09T05:44:20.000Z","updated_at":"2024-08-20T09:47:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"7718115e-64d7-4eb4-abef-47cead83d086","html_url":"https://github.com/OpenGVLab/perception_test_iccv2023","commit_stats":null,"previous_names":["opengvlab/perception_test_iccv2023"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/OpenGVLab/perception_test_iccv2023","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2Fperception_test_iccv2023","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2Fperception_test_iccv2023/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2Fperception_test_iccv2023/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2Fperception_test_iccv2023/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenGVLab","download_url":"https://codeload.github.com/OpenGVLab/perception_test_iccv2023/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenGVLab%2Fperception_test_iccv2023/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262800690,"owners_count":23366414,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-visual","deep-learning","iccv2023"],"created_at":"2025-06-06T12:07:22.006Z","updated_at":"2025-06-30T15:32:10.710Z","avatar_url":"https://github.com/OpenGVLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# perception_test_iccv2023\nChampion Solutions repository for Perception Test challenges in ICCV2023 workshop.\n\n## Introduction  \n\nWe achieves the best performance in Temporal Sound Localisation task and runner-up in Temporal Action Localisation task. In this repository, we provide the pretrained video\\\u0026audio features, checkpoints, and codes for feature extraction, training, and inference.\n\n## Get Started  \n\nPlease refer to INSTALL.md to install the prerequisite packages.  \n\n## Feature Extraction  \n\n### TAL  \n\nFor the video features, we use the UMT large model pre-trained on Something Something-V2 and the VideoMAE model pre-trained on Ego4D-Verb dataset. The weights of Ego4d can be found [here](https://github.com/OpenGVLab/ego4d-eccv2022-solutions). These two features are concatenated before putting into the ActionFormer model during both training and inference stages.\n\nFor the audio features, we use the BEATs model as feature extractor and adopt its iter3+ checkpoints pre-trained on the AudioSet-2M dataset. we provide scripts to extract BEATs and CAV-MAE (although not used), please use `python audio_feat_extract.py` to extract audio features.\n\n### TSL  \n\nFor the video feature, we use the [UMT large model](https://github.com/OpenGVLab/unmasked_teacher) pre-trained on Something Something-V2 and fine-tuned on the perception test temporal action localisation training set. \n\nFor the audio features, we use the BEATs model as feature extractor and adopt its iter3+ checkpoints pre-trained on the AudioSet-2M dataset. we provide scripts to extract BEATs and CAV-MAE (although not used), please use `python audio_feat_extract.py` to extract audio features.\n\n### Download  \n\n| Features | Modality | Task | Download Link |\n|---|---|---|---|\n| BEATs_iter2 | Audio | TAL\\\u0026TSL | [Download](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/pt_tsl_beats_iter3_feature.zip) |\n| Ego4d_verb | Video | TAL | [Download](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/pt_tal_videomae_large_ego4d_verb_feature_s4.zip) |\n| UMT-L Sth Sth-V2 | Video | TAL | [Download](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/pt_tal_umt_large_sthv2_feature_s4.zip) |\n| UMT-L Sth Sth-V2 ft | Video | TSL | [Download](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/pt_tal_umt_large_sthv2_perception_test_ft1_feature_s2.zip) |\n\n\n## Temporal Sound Localisation  \n\n### Training  \n\n`cd ./tsl/` \n\n`python train.py configs/perception_tsl_multi_train.yaml`  \n\n### Inference  \n\nInference on the validation set:  \n\n`cd ./tsl/`  \n\n`python eval.py configs/perception_tsl_multi_valid.yaml ./ckpt/XXX -epoch=XX`  \n\nInference on the test set:  \n\n`cd ./tsl/` \n\n`python eval.py configs/perception_tsl_multi_test.yaml ./ckpt/XXX -epoch=XX --saveonly`  \n\n## Temporal Action Localisation  \n\n`cd ./tal/` \n\n`python train.py configs/perception_tal_multi_train.yaml`  \n\n### Inference  \n\nInference on the validation set:  \n\n`cd ./tal/` \n\n`python eval.py configs/perception_tal_multi_valid.yaml ./ckpt/XXX -epoch=XX`  \n\nInference on the test set:  \n\n`cd ./tal/` \n\n`python eval.py configs/perception_tal_multi_test.yaml ./ckpt/XXX -epoch=XX --saveonly`  \n\n## Checkpoints  \n\nWe release the checkpoint in the below table.  \n\n| Method | Task | mAP (Valid) | Download |\n|---|---|---|---|\n| BEATs + UMT | tsl | 26.70 | [ckpt](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/tsl_multi_epoch20.pth.tar ) |\n| BEATs + UMT ft | tsl | 39.25 | [ckpt](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/tsl_multi_ft_epoch20.pth.tar ) |\n| BEATs + UMT | tal | 44.14 | [ckpt](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/tal_multi_umtonly.pth.tar) |\n| BEATs + UMT\\\u0026VideoMAE | tal | 46.75 | [ckpt](https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/opengvlab/perception_test_iccv2023/tal_multi.pth.tar) |\n\n\n## Contact  \n\nIf you have any questions, please contact [Jiashuo Yu](mailto:yujiashuo[at]pjlab.org.cn) and [Guo Chen](chenguo1177[at]gmail.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengvlab%2Fperception_test_iccv2023","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopengvlab%2Fperception_test_iccv2023","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopengvlab%2Fperception_test_iccv2023/lists"}