{"id":24297841,"url":"https://github.com/ttgeng233/unav","last_synced_at":"2025-09-26T00:30:56.086Z","repository":{"id":164447784,"uuid":"624166915","full_name":"ttgeng233/UnAV","owner":"ttgeng233","description":"Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)","archived":false,"fork":false,"pushed_at":"2024-02-12T07:20:59.000Z","size":20838,"stargazers_count":58,"open_issues_count":6,"forks_count":4,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-11-16T04:10:54.887Z","etag":null,"topics":["audio-visual-events","audio-visual-learning","multi-modal-learning"],"latest_commit_sha":null,"homepage":"https://unav100.github.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ttgeng233.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-05T22:17:26.000Z","updated_at":"2024-11-07T03:19:06.000Z","dependencies_parsed_at":"2024-11-16T04:10:59.790Z","dependency_job_id":"88bf1bc0-3660-45aa-8e71-4f8e6681e740","html_url":"https://github.com/ttgeng233/UnAV","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttgeng233%2FUnAV","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttgeng233%2FUnAV/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttgeng233%2FUnAV/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ttgeng233%2FUnAV/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ttgeng233","download_url":"https://codeload.github.com/ttgeng233/UnAV/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234268873,"owners_count":18805646,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-visual-events","audio-visual-learning","multi-modal-learning"],"created_at":"2025-01-16T20:35:43.835Z","updated_at":"2025-09-26T00:30:50.681Z","avatar_url":"https://github.com/ttgeng233.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline\n\n[[Project page]](https://unav100.github.io/) [[ArXiv]](https://arxiv.org/abs/2303.12930v2)  [[Dataset(Google drive)]](https://drive.google.com/drive/folders/1X4eoCPPtqi0_IKd2JGOv_q383xd27Ybb) [[Dataset(Baidu drive)]](https://pan.baidu.com/share/init?surl=en_Ni7X8-zKwqbUG7ITXig\u0026pwd=6c48) [[Benchmark]](https://paperswithcode.com/sota/audio-visual-event-localization-on-unav-100)\n\nThis repository contains code for CVPR 2023 paper \"[Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline](https://openaccess.thecvf.com/content/CVPR2023/html/Geng_Dense-Localizing_Audio-Visual_Events_in_Untrimmed_Videos_A_Large-Scale_Benchmark_and_CVPR_2023_paper.html)\". This paper introduces the first Untrimmed Audio-Visual (UnAV-100) dataset and proposes to sovle audio-visual event localization problem in more realistic and challenging scenarios. \n\n\n## Requirements\nThe implemetation is based on PyTorch. Follow [INSTALL.md](INSTALL.md) to install required dependencies.\n\n## Data preparation\nThe proposed UnAV-100 dataset can be downloaded from [[Project Page]](https://unav100.github.io/), including YouTube links of raw videos, annotations and extracted features. \n\nIf you want to use your own choices of video features, you can download the raw videos from this [link](https://pan.baidu.com/s/1N2bNc288vK9PDpHkrPBx2A) (Baidu Drive, pwd: qslx). A download script is also provided for raw videos at `scripts/video_download.py`. \n\n**Note**: after downloading data, unpack files under `data/unav100`. The folder structure should look like:\n```\nThis folder\n│   README.md\n│   ...  \n└───data/\n│    └───unav100/\n│    \t └───annotations/\n│               └───unav100_annotations.json\n│    \t └───av_features/   \n│               └───__2MwJ2uHu0_flow.npy    # mix all features together\n│               └───__2MwJ2uHu0_rgb.npy \n│               └───__2MwJ2uHu0_vggish.npy \n|                   ...\n└───libs\n│   ...\n```\n## Training\nRun ```train.py``` to train the model on UnAV-100 dataset. This will create an experiment folder under ```./ckpt``` that stores training config, logs, and checkpoints.\n```\npython ./train.py ./configs/avel_unav100.yaml --output reproduce\n```\n\n## Evaluation\nRun ```eval.py``` to evaluate the trained model. \n```\npython ./eval.py ./configs/avel_unav100.yaml ./ckpt/avel_unav100_reproduce\n```\n[Optional] We also provide a pretrained model for UnAV-100, which can be downloaded from [this link](https://drive.google.com/file/d/1qiC2osEaBSH8HFvF0WY_535F21CM3JXj/view?usp=share_link).\n\n## Citation\nIf you find our dataset and code are useful for your research, please cite our paper\n```\n@inproceedings{geng2023dense,\n  title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline},\n  author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng},\n  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},\n  pages={22942--22951},\n  year={2023}\n}\n```\n\n## Acknowledgement\nThe video features of I3D-rgb \u0026 flow and Vggish-audio were extracted using [video_features](https://github.com/v-iashin/video_features). Our baseline model was implemented based on [ActionFormer](https://github.com/happyharrycn/actionformer_release). We thank the authors for sharing their codes. If you use our code, please consider to cite their works.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fttgeng233%2Funav","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fttgeng233%2Funav","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fttgeng233%2Funav/lists"}