{"id":13862226,"url":"https://github.com/LAION-AI/CLAP","last_synced_at":"2025-07-14T11:32:58.839Z","repository":{"id":37685313,"uuid":"466845016","full_name":"LAION-AI/CLAP","owner":"LAION-AI","description":"Contrastive Language-Audio Pretraining","archived":false,"fork":false,"pushed_at":"2024-07-09T02:20:16.000Z","size":9577,"stargazers_count":1422,"open_issues_count":55,"forks_count":137,"subscribers_count":29,"default_branch":"main","last_synced_at":"2024-11-19T11:08:54.594Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2211.06687","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LAION-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-06T20:12:49.000Z","updated_at":"2024-11-18T10:41:18.000Z","dependencies_parsed_at":"2023-11-09T00:28:36.648Z","dependency_job_id":"fb54352e-c9c9-4216-87d5-5fb58dce811d","html_url":"https://github.com/LAION-AI/CLAP","commit_stats":{"total_commits":706,"total_committers":10,"mean_commits":70.6,"dds":"0.23796033994334276","last_synced_commit":"39d746ce24b2bf3c43b73a9c795e9406a235e45d"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2FCLAP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2FCLAP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2FCLAP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LAION-AI%2FCLAP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LAION-AI","download_url":"https://codeload.github.com/LAION-AI/CLAP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225974360,"owners_count":17553940,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-05T06:01:39.880Z","updated_at":"2025-07-14T11:32:58.827Z","avatar_url":"https://github.com/LAION-AI.png","language":"Python","readme":"# CLAP\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/LAION-AI/CLAP/main/assets/logo.PNG\" alt=\"The Contrastive Language-Audio Pretraining Model Architecture\" width=\"60%\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://arxiv.org/abs/2211.06687\"\u003e\u003cimg src=\"https://img.shields.io/badge/arXiv-2211.06687-brightgreen.svg?style=flat-square\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/laion-clap\"\u003e\u003cimg src=\"https://badge.fury.io/py/laion-clap.svg\"/\u003e\u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/clap\"\u003e\u003cimg src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Transformers-blue\"/\u003e\u003c/a\u003e\n\u003c/p\u003e\n \n### This repository provides representations of audios and texts via Contrastive Language-Audio Pretraining (CLAP)\n\nWith CLAP, you can extract a latent representation of any given audio and text for your own model, or for different downstream tasks.\n\nAll codes are comming officially with the following paper, accepted by IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023:\n - [Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation](https://arxiv.org/abs/2211.06687)\n\n**New Updates:** \n\n\u003cb\u003e1. We release new CLAP pretrained checkpoints pretrained on music and speech data collecstions from [our dataset collection repo](https://github.com/LAION-AI/audio-dataset).\u003c/b\u003e\n\n\u003cb\u003e2. CLAP model is incorporated and supported by [HuggingFace Transformers](https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/clap). Many thanks to [Younes Belkada](https://huggingface.co/ybelkada) and [Arthur Zucker](https://fr.linkedin.com/in/arthur-zucker-8a0445144) for contributing to the HuggingFace support. \u003c/b\u003e\n\n## About this project\n\nThis project is a project in [LAION](https://laion.ai/) that aims at learning better audio understanding and getting more audio data. \nThis is an opensource project. We adopt the codebase of [open_clip](https://github.com/mlfoundations/open_clip) for this project. \n\nmany thanks to \u003ca href=\"https://github.com/cfoster0/CLAP\"\u003e@cfoster0\u003c/a\u003e for allowing us to use his repo name.\n\n## Architecture\nContrastive Language-Audio Pretraining, known as CLAP. Referring to the CLIP (Contrastive Language-Image Pretraining) architecture, the CLAP architecture is as follows.  \n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/LAION-AI/CLAP/main/assets/audioclip-arch.png\" alt=\"The Contrastive Language-Audio Pretraining Model Architecture\" width=\"60%\"/\u003e\n\u003c/p\u003e\n\n## Quick Start \nWe provide the PyPI library for our CLAP model:\n```bash\npip install laion-clap\n```\n\nThen you can follow the below usage or refer to [unit_test.py](https://github.com/LAION-AI/CLAP/blob/laion_clap_pip/src/laion_clap/unit_test.py).\n\nFor the documentation of the API, please refer to [hook.py](https://github.com/LAION-AI/CLAP/blob/main/src/laion_clap/hook.py).\n\n```python\nimport numpy as np\nimport librosa\nimport torch\nimport laion_clap\n\n# quantization\ndef int16_to_float32(x):\n    return (x / 32767.0).astype('float32')\n\n\ndef float32_to_int16(x):\n    x = np.clip(x, a_min=-1., a_max=1.)\n    return (x * 32767.).astype('int16')\n\nmodel = laion_clap.CLAP_Module(enable_fusion=False)\nmodel.load_ckpt() # download the default pretrained checkpoint.\n\n# Directly get audio embeddings from audio files\naudio_file = [\n    '/home/data/test_clap_short.wav',\n    '/home/data/test_clap_long.wav'\n]\naudio_embed = model.get_audio_embedding_from_filelist(x = audio_file, use_tensor=False)\nprint(audio_embed[:,-20:])\nprint(audio_embed.shape)\n\n# Get audio embeddings from audio data\naudio_data, _ = librosa.load('/home/data/test_clap_short.wav', sr=48000) # sample rate should be 48000\naudio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)\naudio_embed = model.get_audio_embedding_from_data(x = audio_data, use_tensor=False)\nprint(audio_embed[:,-20:])\nprint(audio_embed.shape)\n\n# Directly get audio embeddings from audio files, but return torch tensor\naudio_file = [\n    '/home/data/test_clap_short.wav',\n    '/home/data/test_clap_long.wav'\n]\naudio_embed = model.get_audio_embedding_from_filelist(x = audio_file, use_tensor=True)\nprint(audio_embed[:,-20:])\nprint(audio_embed.shape)\n\n# Get audio embeddings from audio data\naudio_data, _ = librosa.load('/home/data/test_clap_short.wav', sr=48000) # sample rate should be 48000\naudio_data = audio_data.reshape(1, -1) # Make it (1,T) or (N,T)\naudio_data = torch.from_numpy(int16_to_float32(float32_to_int16(audio_data))).float() # quantize before send it in to the model\naudio_embed = model.get_audio_embedding_from_data(x = audio_data, use_tensor=True)\nprint(audio_embed[:,-20:])\nprint(audio_embed.shape)\n\n# Get text embedings from texts:\ntext_data = [\"I love the contrastive learning\", \"I love the pretrain model\"] \ntext_embed = model.get_text_embedding(text_data)\nprint(text_embed)\nprint(text_embed.shape)\n\n# Get text embedings from texts, but return torch tensor:\ntext_data = [\"I love the contrastive learning\", \"I love the pretrain model\"] \ntext_embed = model.get_text_embedding(text_data, use_tensor=True)\nprint(text_embed)\nprint(text_embed.shape)\n\n```\n\n## Pretrained Models\nThe pretrained checkpoints can be found in [here](https://huggingface.co/lukewys/laion_clap/tree/main).\nPlease refer to the previous section for how to load and run the checkpoints.\nFor the PyPI library, [630k-audioset-best.pt](https://huggingface.co/lukewys/laion_clap/blob/main/630k-audioset-best.pt) and [630k-audioset-fusion-best.pt](https://huggingface.co/lukewys/laion_clap/blob/main/630k-audioset-fusion-best.pt) are our default models (non-fusion and fusion)\n\nWe further provide below pretrained models according to your usages:\n\n* For general audio less than 10-sec: [630k-audioset-best.pt](https://huggingface.co/lukewys/laion_clap/blob/main/630k-audioset-best.pt) or [630k-best.pt](https://huggingface.co/lukewys/laion_clap/blob/main/630k-best.pt)\n* For general audio with variable-length: [630k-audioset-fusion-best.pt](https://huggingface.co/lukewys/laion_clap/blob/main/630k-audioset-fusion-best.pt) or [630k-fusion-best.pt](https://huggingface.co/lukewys/laion_clap/blob/main/630k-fusion-best.pt)\n* For music: [music_audioset_epoch_15_esc_90.14.pt](https://huggingface.co/lukewys/laion_clap/blob/main/music_audioset_epoch_15_esc_90.14.pt)\n* For music and speech: [music_speech_epoch_15_esc_89.25.pt](https://huggingface.co/lukewys/laion_clap/blob/main/music_speech_epoch_15_esc_89.25.pt)\n* For speech, music and general audio: [music_speech_audioset_epoch_15_esc_89.98.pt](https://huggingface.co/lukewys/laion_clap/blob/main/music_speech_audioset_epoch_15_esc_89.98.pt)\n\nThe checkpoints list here for each model setting is the one with the highest average mAP score in training.\nThe average mAP score is calculated by averaging 4 scores: A--\u003eT mAP@10 on AudioCaps, and T--\u003eA mAP@10 on AudioCaps, A--\u003eT mAP@10 on Clotho, and T--\u003eA mAP@10 on Clotho.\n\nTo use above pretrained models, you need to load the ckpt by yourself, as:\n\nUpdate 2023.4.7: we have released 3 larger CLAP models trained on music, speech dataset in addition to LAION-Audio-630k. Here are descriptions of the model and their performance:\n\n - `music_speech_audioset_epoch_15_esc_89.98.pt`: trained on music + speech + Audioset + LAION-Audio-630k. The zeroshot ESC50 performance is 89.98%, the GTZAN performance is 51%.\n - `music_audioset_epoch_15_esc_90.14.pt`: trained on music + Audioset + LAION-Audio-630k. The zeroshot ESC50 performance is 90.14%, the GTZAN performance is 71%.\n - `music_speech_epoch_15_esc_89.25.pt`: trained on music + speech + LAION-Audio-630k. The zeroshot ESC50 performance is 89.25%, the GTZAN performance is 69%.\n\nThe model uses a larger audio encoder. To load the model using the pip API:\n```python\nimport laion_clap\nmodel = laion_clap.CLAP_Module(enable_fusion=False, amodel= 'HTSAT-base')\nmodel.load_ckpt('checkpoint_path/checkpoint_name.pt')\n```\n\nPlease note that this is a temporary release for people who are working on larger-scale down-stream task. \nWe will release a more comprehensive version of the model with detailed experiments in the future.\nPlease take your own risk when using this model.\n\n* All the new checkpoints did not trained with fusion. The training dataset size for `music_speech_audioset_epoch_15_esc_89.98.pt` is around 4M samples. The zeroshot GTZAN score is evaluated using the prompt `This audio is a \u003cgenre\u003e song.`\n\n\u003c!-- We provide the CLAP's performance on audio classification tasks under the zero-shot setting or the supervised setting. More results can be found at our paper.\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/LAION-AI/CLAP/main/assets/clap-zeroshot.PNG\" alt=\"Zero-shot Performance\" width=\"100%\"/\u003e\n\u003c/p\u003e --\u003e\n\n\n\n\n## Environment Installation\nIf you want to check and reuse our model into your project instead of directly using the pip library, you need to install the same environment as we use, please run the following command:\n```bash\nconda create env -n clap python=3.10\nconda activate clap\ngit clone https://github.com/LAION-AI/CLAP.git\ncd CLAP\n# you can also install pytorch by following the official instruction (https://pytorch.org/get-started/locally/)\npip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html\npip install -r requirements.txt\n```\n## Dataset format\nWe use training data in webdataset format. For details of our dataset please see https://github.com/LAION-AI/audio-dataset.\n\nDue to copyright reasons, we cannot release the dataset we train this model on. However, we released [LAION-audio-630K](https://github.com/LAION-AI/audio-dataset/tree/main/laion-audio-630k), the data source we used to compose the dataset with link to each audio and their caption. Please refer to [LAION-audio-630K](https://github.com/LAION-AI/audio-dataset/tree/main/laion-audio-630k) for more details. You could download the dataset, preprocess it on your own and train it locally. To train on the local dataset, please change the `--remotedata` in training scripts (see [experiment_scripts](./experiment_scripts) folder) with `--datasetpath \u003cyour dir to datasets\u003e`.\n\nYou can find an example of our dataset format in [here](https://drive.google.com/drive/folders/1scyH43eQAcrBz-5fAw44C6RNBhC3ejvX?usp=sharing).\nIt contains the full ESC50 dataset, split according to the first 5-fold split.\n\n## Training, Fine-tuning and Evaluation\nPlease find the script of training, fine-tuning and evaluation (zero-shot and retrieval) in the [experiment_scripts](./experiment_scripts) folder. \nThe scripts included there are the one we used to train our model on a SLURM cluster. \nYou need to change the script to fit your own environment.\nFor example, in a single machine multi-GPU setting, you might want to use `torchrun` instead of `srun` to run the script.\nTo train on a single GPU machine, use `CUDA_VISIBLE_DEVICES=0 python -m ...` instead of `srun`.\nWe use [Weights and Biases](https://wandb.ai/site) for experiment logging. You need to configure the weights and biases in your environment.\nTo train on local dataset, please change the `--remotedata` in training scripts (see [experiment_scripts](./experiment_scripts) folder) with `--datasetpath \u003cyour dir to datasets\u003e`.\n\n## Core Code\nPlease refer to [main.py](https://github.com/LAION-AI/CLAP/blob/laion_clap_pip/src/laion_clap/training/main.py), [train.py](https://github.com/LAION-AI/CLAP/blob/laion_clap_pip/src/laion_clap/training/train.py), [data.py](https://github.com/LAION-AI/CLAP/blob/laion_clap_pip/src/laion_clap/training/data.py),and [model.py](https://github.com/LAION-AI/CLAP/blob/laion_clap_pip/src/laion_clap/clap_module/model.py) to quicly get familiar with our model.\n\n\n## Reproducibility\nAn example of the preprocessed Clotho dataset in webdataset format can be download [here](https://drive.google.com/drive/folders/1mU9mBOe11jTFCrQRJQsUa4S-3TlNuYoI?usp=sharing) (by downloading, you will be agreeing the license described in the [Clotho dataset](https://zenodo.org/record/3490684#.Y9ALPeyZP1w)). The audio encoder pretrained with 48kHz AudioSet can be found [here](https://drive.google.com/drive/folders/1SMQyzJvc6DwJNuhQ_WI8tlCFL5HG2vk6?usp=sharing), where `HTSAT-fullset-imagenet-map=0.467.ckpt` is the checkpoint used to initalize our HTSAT audio encoder. You should get similar result by loading from the audio encoder checkpoint and training on same dataset.\n\nThe script to train the model on Clotho dataset is included [here](experiment_scripts/train-only-clotho.sh). You need to replace the `datasetpath` and `pretrained-audio` to pointing to your own directory. You could check the [report](https://stability.wandb.io/clap/clap/reports/CLAP-trained-on-Clotho-dataset--VmlldzoyNzY?accessToken=c0erq9hhp7h880jclihd9j9if679s6bylwto33vo14yo5jg40ppe38qeoafoonpz) of the training script on a single A100 GPU for reference.\n\nBecause most of the dataset has copyright restriction, unfortunatly we cannot directly share other preprocessed datasets. The caption generated by keyword-to-caption model for Audioset can be found [here](https://github.com/LAION-AI/audio-dataset/tree/main/laion-audio-630k#keyword-to-caption-augmentation)\n\n\n## Zeroshot Classification with ESC50 official split\n\nHere is an example code to run the zeroshot classification on **first** ESC50 official split with the pip API:\n\n```python\nimport laion_clap\nimport glob\nimport json\nimport torch\nimport numpy as np\n\ndevice = torch.device('cuda:0')\n\n# download https://drive.google.com/drive/folders/1scyH43eQAcrBz-5fAw44C6RNBhC3ejvX?usp=sharing and extract ./ESC50_1/test/0.tar to ./ESC50_1/test/\nesc50_test_dir = './ESC50_1/test/*/'\nclass_index_dict_path = './class_labels/ESC50_class_labels_indices_space.json'\n\n# Load the model\nmodel = laion_clap.CLAP_Module(enable_fusion=False, device=device)\nmodel.load_ckpt()\n\n# Get the class index dict\nclass_index_dict = {v: k for v, k in json.load(open(class_index_dict_path)).items()}\n\n# Get all the data\naudio_files = sorted(glob.glob(esc50_test_dir + '**/*.flac', recursive=True))\njson_files = sorted(glob.glob(esc50_test_dir + '**/*.json', recursive=True))\nground_truth_idx = [class_index_dict[json.load(open(jf))['tag'][0]] for jf in json_files]\n\nwith torch.no_grad():\n    ground_truth = torch.tensor(ground_truth_idx).view(-1, 1)\n\n    # Get text features\n    all_texts = [\"This is a sound of \" + t for t in class_index_dict.keys()]\n    text_embed = model.get_text_embedding(all_texts)\n    audio_embed = model.get_audio_embedding_from_filelist(x=audio_files)\n\n    ranking = torch.argsort(torch.tensor(audio_embed) @ torch.tensor(text_embed).t(), descending=True)\n    preds = torch.where(ranking == ground_truth)[1]\n    preds = preds.cpu().numpy()\n\n    metrics = {}\n    metrics[f\"mean_rank\"] = preds.mean() + 1\n    metrics[f\"median_rank\"] = np.floor(np.median(preds)) + 1\n    for k in [1, 5, 10]:\n        metrics[f\"R@{k}\"] = np.mean(preds \u003c k)\n    # map@10\n    metrics[f\"mAP@10\"] = np.mean(np.where(preds \u003c 10, 1 / (preds + 1), 0.0))\n\n    print(\n        f\"Zeroshot Classification Results: \"\n        + \"\\t\".join([f\"{k}: {round(v, 4):.4f}\" for k, v in metrics.items()])\n    )\n```\n\nFor ESC50 dataset, you could either download our processed ESC50 in webdataset format \nfrom [here](https://drive.google.com/drive/folders/1scyH43eQAcrBz-5fAw44C6RNBhC3ejvX?usp=sharing), and extract the \n`./test/0.tar` to `./test/`. Or you could download the original ESC50 dataset and \npreprocess the label to the format of `class_labels/ESC50_class_labels_indices_space.json` by yourself (replace `_` with space).\n\nThe result should be the same as the following:\n\nFor `model = laion_clap.CLAP_Module(enable_fusion=True, device=device)`: `mean_rank: 1.2425\tmedian_rank: 1.0000\tR@1: 0.9050\tR@5: 0.9900\tR@10: 0.9925\tmAP@10: 0.9407`\n\nFor `model = laion_clap.CLAP_Module(enable_fusion=False, device=device)`: `mean_rank: 1.1450\tmedian_rank: 1.0000\tR@1: 0.9275\tR@5: 0.9975\tR@10: 1.0000\tmAP@10: 0.9556`\n\nNote that the results is slightly higher than the reported results in the paper, because we use the train + test data of ESC50 and removing the data overlap in other training datasets (mainly freesound).\n\n## Citation\nIf you find this project and the LAION-Audio-630K dataset useful, please cite our paper:\n```\n@inproceedings{laionclap2023,\n  title = {Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},\n  author = {Wu*, Yusong and Chen*, Ke and Zhang*, Tianyu and Hui*, Yuchen and Berg-Kirkpatrick, Taylor and Dubnov, Shlomo},\n  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},\n  year = {2023}\n}\n@inproceedings{htsatke2022,\n  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},\n  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},\n  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP},\n  year = {2022}\n}\n```\n\n## Acknowledgements\n\nThis project is working in progress, thus the codebase and model might not be perfect or bug-free. \nWe will very much appreciate any kind of contribution or and issue raised.\nIf you find a bug or have any suggestion, please feel free to open an issue or contact us.\nIf you would actively contribute to this project, please join the discord of LAION.\n","funding_links":[],"categories":["Python","Acknowledgement","Audio Embeddings \u0026 Representations"],"sub_categories":["Feel free to explore the repository and contribute!"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLAION-AI%2FCLAP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FLAION-AI%2FCLAP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FLAION-AI%2FCLAP/lists"}