{"id":13862028,"url":"https://github.com/RetroCirce/HTS-Audio-Transformer","last_synced_at":"2025-07-14T11:32:13.174Z","repository":{"id":38225216,"uuid":"454608227","full_name":"RetroCirce/HTS-Audio-Transformer","owner":"RetroCirce","description":"The official code repo of \"HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection\"","archived":false,"fork":false,"pushed_at":"2024-08-16T19:13:26.000Z","size":918,"stargazers_count":415,"open_issues_count":26,"forks_count":68,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-07-02T19:03:31.014Z","etag":null,"topics":["audio-classification","music-information-retrieval","python","sound-event-detection","transformer-models"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2202.00874","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RetroCirce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-02-02T01:24:44.000Z","updated_at":"2025-06-27T20:12:32.000Z","dependencies_parsed_at":"2022-06-27T00:03:41.343Z","dependency_job_id":null,"html_url":"https://github.com/RetroCirce/HTS-Audio-Transformer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/RetroCirce/HTS-Audio-Transformer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RetroCirce%2FHTS-Audio-Transformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RetroCirce%2FHTS-Audio-Transformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RetroCirce%2FHTS-Audio-Transformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RetroCirce%2FHTS-Audio-Transformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RetroCirce","download_url":"https://codeload.github.com/RetroCirce/HTS-Audio-Transformer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RetroCirce%2FHTS-Audio-Transformer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265285594,"owners_count":23740559,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-classification","music-information-retrieval","python","sound-event-detection","transformer-models"],"created_at":"2024-08-05T06:01:35.136Z","updated_at":"2025-07-14T11:32:12.809Z","avatar_url":"https://github.com/RetroCirce.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Hierarchical Token Semantic Audio Transformer\n\n\n## Introduction\n\nThe Code Repository for  \"[HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection](https://arxiv.org/abs/2202.00874)\", in ICASSP 2022.\n\nIn this paper, we devise a model, HTS-AT, by combining a [swin transformer](https://github.com/microsoft/Swin-Transformer) with a token-semantic module and adapt it in to **audio classification** and **sound event detection tasks**. HTS-AT is an efficient and light-weight audio transformer with a hierarchical structure and has only 30 million parameters. It achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. \n\n![HTS-AT Architecture](fig/arch.png)\n\n## Classification Results on AudioSet, ESC-50, and Speech Command V2 (mAP)\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"fig/ac_result.png\" align=\"center\" alt=\"HTS-AT ClS Result\" width=\"50%\"/\u003e\n\u003c/p\u003e\n\n\n## Localization/Detection Results on DESED dataset (F1-Score)\n\n![HTS-AT Localization Result](fig/local_result.png)\n\n\n## Getting Started\n\n### Install Requirments\n```\npip install -r requirements.txt\n```\n\nWe do not include the installation of PyTorch in the requirment, since different machines require different vereions of CUDA and Toolkits. So make sure you install the PyTorch from [the official guidance](https://pytorch.org/).\n\nInstall the 'SOX' and the 'ffmpeg', we recommend that you run this code in Linux inside the Conda environment. In that, you can install them by:\n```\nsudo apt install sox \nconda install -c conda-forge ffmpeg\n```\n### Download and Processing Datasets\n\n* config.py\n```\nchange the varible \"dataset_path\" to your audioset address\nchange the variable \"desed_folder\" to your DESED address\nchange the classes_num to 527\n```\n\n* [AudioSet](https://research.google.com/audioset/download.html)\n```\n./create_index.sh # \n// remember to change the pathes in the script\n// more information about this script is in https://github.com/qiuqiangkong/audioset_tagging_cnn\n\npython main.py save_idc \n// count the number of samples in each class and save the npy files\n```\n* [ESC-50](https://github.com/karolpiczak/ESC-50)\n```\nOpen the jupyter notebook at esc-50/prep_esc50.ipynb and process it\n```\n* [Speech Command V2](https://arxiv.org/pdf/1804.03209.pdf)\n```\nOpen the jupyter notebook at scv2/prep_scv2.ipynb and process it\n```\n* [DESED Dataset](https://project.inria.fr/desed/) \n```\npython conver_desed.py \n// will produce the npy data files\n```\n\n### Set the Configuration File: config.py\n\nThe script *config.py* contains all configurations you need to assign to run your code. \nPlease read the introduction comments in the file and change your settings.\n\n**IMPORTANT NOTICE**\n\nSimilar to many transformer structures, the HTS-AT needs **warm-up** otherwise the model will underfit in the beginning. To find a proper warm-up step or warm-up epoch, please pay attention to [these two hyperparameters](https://github.com/RetroCirce/HTS-Audio-Transformer/blob/a6caae40149bd6667b0d898793bdeaf7c26bda47/config.py#L33-L34) in the configuration file. The default settings works for the full AudioSet (2.2M data samples). If your working dataset contains different size of samples (e.g. 100K, 1M, 10M, etc.), you might need to change a proper warm-up step or epoch. \n\nFor the most important part:\nIf you want to train/test your model on AudioSet, you need to set:\n```\ndataset_path = \"your processed audioset folder\"\ndataset_type = \"audioset\"\nbalanced_data = True\nloss_type = \"clip_bce\"\nsample_rate = 32000\nhop_size = 320 \nclasses_num = 527\n```\n\nIf you want to train/test your model on ESC-50, you need to set:\n```\ndataset_path = \"your processed ESC-50 folder\"\ndataset_type = \"esc-50\"\nloss_type = \"clip_ce\"\nsample_rate = 32000\nhop_size = 320 \nclasses_num = 50\n```\n\nIf you want to train/test your model on Speech Command V2, you need to set:\n```\ndataset_path = \"your processed SCV2 folder\"\ndataset_type = \"scv2\"\nloss_type = \"clip_bce\"\nsample_rate = 16000\nhop_size = 160\nclasses_num = 35\n```\n\nIf you want to test your model on DESED, you need to set:\n```\nresume_checkpoint = \"Your checkpoint on AudioSet\"\nheatmap_dir = \"localization results output folder\"\ntest_file = \"output heatmap name\"\nfl_local = True\nfl_dataset = \"Your DESED npy file\"\n```\n\n### Train and Evaluation\n\n**Notice: Our model is now supporting the single GPU.**\n\nAll scripts is run by main.py:\n```\nTrain: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py train\n\nTest: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test\n\nEnsemble Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py esm_test \n// See config.py for settings of ensemble testing\n\nWeight Average: python main.py weight_average\n// See config.py for settings of weight averaging\n```\n\n### Localization on DESED\n```\nCUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test\n// make sure that fl_local=True in config.py\npython fl_evaluate.py\n// organize and gather the localization results\nfl_evaluate_f1.ipynb\n// Follow the notebook to produce the results\n```\n\n### Model Checkpoints:\n\nWe provide the model checkpoints on three datasets (and additionally DESED dataset) in this [link](https://drive.google.com/drive/folders/1f5VYMk0uos_YnuBshgmaTVioXbs7Kmz6?usp=sharing). Feel free to download and test it.\n\n## Citing\n```\n@inproceedings{htsat-ke2022,\n  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},\n  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},\n  booktitle = {{ICASSP} 2022}\n}\n```\nOur work is based on [Swin Transformer](https://github.com/microsoft/Swin-Transformer), which is a famous image classification transformer model.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRetroCirce%2FHTS-Audio-Transformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRetroCirce%2FHTS-Audio-Transformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRetroCirce%2FHTS-Audio-Transformer/lists"}