{"id":29026121,"url":"https://github.com/mtg/da-tacos","last_synced_at":"2025-06-26T05:08:37.605Z","repository":{"id":50719809,"uuid":"217303951","full_name":"MTG/da-tacos","owner":"MTG","description":"A Dataset for Cover Song Identification and Understanding","archived":false,"fork":false,"pushed_at":"2023-02-23T14:49:23.000Z","size":92,"stargazers_count":54,"open_issues_count":4,"forks_count":4,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-04-15T00:14:59.656Z","etag":null,"topics":["audio-analysis","cover-song-identification","music-information-retrieval","music-similarity","open-datasets"],"latest_commit_sha":null,"homepage":"https://mtg.github.io/da-tacos","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MTG.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-10-24T13:18:50.000Z","updated_at":"2024-02-06T10:55:10.000Z","dependencies_parsed_at":"2022-08-31T02:11:58.081Z","dependency_job_id":"c420b9ec-e8c2-4c8c-9f5a-b07ace9bbc3b","html_url":"https://github.com/MTG/da-tacos","commit_stats":{"total_commits":33,"total_committers":4,"mean_commits":8.25,"dds":0.4545454545454546,"last_synced_commit":"b85e7a5cbb07e012afacee3e9d84b8f344ea5b1a"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/MTG/da-tacos","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MTG%2Fda-tacos","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MTG%2Fda-tacos/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MTG%2Fda-tacos/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MTG%2Fda-tacos/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MTG","download_url":"https://codeload.github.com/MTG/da-tacos/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MTG%2Fda-tacos/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262003991,"owners_count":23243358,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio-analysis","cover-song-identification","music-information-retrieval","music-similarity","open-datasets"],"created_at":"2025-06-26T05:08:30.493Z","updated_at":"2025-06-26T05:08:37.582Z","avatar_url":"https://github.com/MTG.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e \n    \u003cimg src=\"https://user-images.githubusercontent.com/32430027/67803708-e7ee5600-fa8d-11e9-8c63-5eeaea83e57a.png\" alt=\"Da-TACOS\" width=\"150\"/\u003e\n\u003c/p\u003e\n\nWe present Da-TACOS: a dataset for cover song identification and understanding. It contains two subsets, namely **the benchmark subset** (for benchmarking cover song identification systems) and **the cover analysis subset** (for analyzing the links among cover songs), with **pre-extracted features** and **metadata** for **15,000** and **10,000 songs**, respectively. The annotations included in the metadata are obtained with the API of [SecondHandSongs.com](https://secondhandsongs.com). All audio files we use to extract features are encoded in MP3 format and their sample rate is 44.1 kHz. Da-TACOS does not contain any audio files. For **the results** of **our analyses on modifiable musical characteristics** using the cover analysis subset and **our initial benchmarking of 7 state-of-the-art cover song identification algorithms** on the benchmark subset, you can look at our [publication](http://archives.ismir.net/ismir2019/paper/000038.pdf).\n\nFor organizing the data, we use the structure of SecondHandSongs where each song is called a **'performance'**, and each clique (cover group) is called a **'work'**. Based on this, the file names of the songs are their unique performance IDs (PID, e.g. `P_22`), and their labels with respect to their cliques are their work IDs (WID, e.g. `W_14`).\n\nMetadata for each song includes \n* performance title, \n* performance artist, \n* work title, \n* work artist, \n* release year, \n* SecondHandSongs.com performance ID, \n* SecondHandSongs.com work ID,  \n* whether the song is instrumental or not. \n\nIn addition, we matched the original metadata with MusicBrainz to obtain MusicBrainz ID (MBID), song length and genre/style tags. We would like to note that MusicBrainz related information is not available for all the songs in Da-TACOS, and since we used just our metadata for matching, we include all possible MBIDs for a particular songs.\n\nFor facilitating **reproducibility** in cover song identification (CSI) research, we propose **a framework for feature extraction and benchmarking** in our supplementary repository: [acoss](https://github.com/furkanyesiler/acoss). **The feature extraction component** is designed to help CSI researchers to find **the most commonly used features for CSI in a single address**. The parameter values we used to extract the features in Da-TACOS are shared in the same repository. Moreover, **the benchmarking component** includes our implementations of **7 state-of-the-art CSI systems**. We provide the performance results of **an initial benchmarking** of those **7 systems** on the benchmark subset of Da-TACOS. We encourage other CSI researchers to contribute to acoss with implementing their favorite feature extraction algorithms and their CSI systems to build up a knowledge base where CSI research can reach larger audiences. \n\nThe instructions for how to download and use the dataset are shared below. Please contact us if you have any questions or requests.\n\n## Structure\n\n### Metadata\n\nWe provide two metadata files that contain information about the benchmark subset and the cover analysis subset. Both metadata files are stored as python dictionaries in `.json` format, and have the same hierarchical structure. \n\nAn example to load the metadata files in python:\n\n```python\nimport json\n\nwith open('./da-tacos_metadata/da-tacos_benchmark_subset_metadata.json') as f:\n\tbenchmark_metadata = json.load(f)\n```\n\nThe python dictionary obtained with the code above will have the respective WIDs as keys. Each key will provide the song dictionaries that contain the metadata regarding the songs that belong to their WIDs. An example can be seen below:\n\n```python\n\"W_163992\": { # work id\n\t\"P_547131\": { # performance id of the first song belonging to the clique 'W_163992'\n\t\t\"work_title\": \"Trade Winds, Trade Winds\",\n\t\t\"work_artist\": \"Aki Aleong\",\n\t\t\"perf_title\": \"Trade Winds, Trade Winds\",\n\t\t\"perf_artist\": \"Aki Aleong\",\n\t\t\"release_year\": \"1961\",\n\t\t\"work_id\": \"W_163992\",\n\t\t\"perf_id\": \"P_547131\",\n\t\t\"instrumental\": \"No\",\n\t\t\"perf_artist_mbid\": \"9bfa011f-8331-4c9a-b49b-d05bc7916605\",\n\t\t\"mb_performances\": {\n\t\t\t\"4ce274b3-0979-4b39-b8a3-5ae1de388c4a\": {\n\t\t\t\t\"length\": \"175000\"\n\t\t\t},\n\t\t\t\"7c10ba3b-6f1d-41ab-8b20-14b2567d384a\": {\n\t\t\t\t\"length\": \"177653\"\n\t\t\t}\n\t\t}\n\t},\n\t\"P_547140\": { # performance id of the second song belonging to the clique 'W_163992'\n\t\t\"work_title\": \"Trade Winds, Trade Winds\",\n\t\t\"work_artist\": \"Aki Aleong\",\n\t\t\"perf_title\": \"Trade Winds, Trade Winds\",\n\t\t\"perf_artist\": \"Dodie Stevens\",\n\t\t\"release_year\": \"1961\",\n\t\t\"work_id\": \"W_163992\",\n\t\t\"perf_id\": \"P_547140\",\n\t\t\"instrumental\": \"No\"\n\t}\n}\n```\n\n\n### Pre-extracted features\n\nThe list of features included in Da-TACOS can be seen below. All the features are extracted with [acoss](https://github.com/furkanyesiler/acoss/blob/master/acoss/features.py) repository that uses open-source feature extraction libraries such as [Essentia](https://essentia.upf.edu/documentation/), [LibROSA](https://librosa.github.io/librosa/), and [Madmom](https://github.com/CPJKU/madmom).\n\nTo facilitate the use of the dataset, we provide two options regarding the file structure.\n\n1- In `da-tacos_benchmark_subset_single_files` and `da-tacos_coveranalysis_subset_single_files` folders, we organize the data based on their respective cliques, and one file contains all the features for that particular song. \n\n```python\n{\n\t\"chroma_cens\": numpy.ndarray,\n\t\"crema\": numpy.ndarray,\n\t\"hpcp\": numpy.ndarray,\n\t\"key_extractor\": {\n\t\t\"key\": numpy.str_,\n\t\t\"scale\": numpy.str_,_\n\t\t\"strength\": numpy.float64\n\t},\n\t\"madmom_features\": {\n\t\t\"novfn\": numpy.ndarray, \n\t\t\"onsets\": numpy.ndarray,\n\t\t\"snovfn\": numpy.ndarray,\n\t\t\"tempos\": numpy.ndarray\n\t}\n\t\"mfcc_htk\": numpy.ndarray,\n\t\"tags\": list of (numpy.str_, numpy.str_)\n\t\"label\": numpy.str_,\n\t\"track_id\": numpy.str_\n}\n\n\n```\n\n2- In `da-tacos_benchmark_subset_FEATURE` and `da-tacos_coveranalysis_subset_FEATURE` folders, the data is organized based on their cliques as well, but each of these folders contain only one feature per song. For instance, if you want to test your system that uses HPCP features, you can download `da-tacos_benchmark_subset_hpcp` to access the pre-computed HPCP features. An example for the contents in those files can be seen below:\n\n```python\n{\n\t\"hpcp\": numpy.ndarray,\n\t\"label\": numpy.str_,\n\t\"track_id\": numpy.str_\n}\n\n```\n\n## Using the dataset\n\n### Requirements\n\n* Python 3.6+\n* Create virtual environment and install requirements\n```bash\ngit clone https://github.com/MTG/da-tacos.git\ncd da-tacos\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\n### Downloading the data\n\nThe dataset is currently stored in only in Google Drive (it will be uploaded to Zenodo soon), and can be downloaded from this [link](https://drive.google.com/open?id=1GfFF_Kan_Qe69MF15i3-_LqE4wn3XNsb). We also provide a python script that automatically downloads the folders you specify. Basic usage of this script can be seen below:\n\n```bash\npython download_da-tacos.py -h\n```\n```\nusage: download_da-tacos.py [-h]\n                            [--dataset {benchmark,coveranalysis,da-tacos}]                                                                          \n                            [--type {single_files,cens,crema,hpcp,key,madmom,mfcc,tags}]   \n                            [--source {gdrive,zenodo}]                                                         \n                            [--outputdir OUTPUTDIR]\n                            [--unpack]\n                            [--remove]\n\nDownload script for Da-TACOS \n\noptional arguments:                                                                                                       \n  -h, --help            show this help message and exit                                                                   \n  --dataset {metadata,benchmark,coveranalysis,da-tacos}                                                                      \n                        which subset to download. 'da-tacos' option downloads\n                        both subsets. the options other than 'metadata' will\n                        download the metadata as well. (default: metadata)                                                                     \n  --type {single_files,cens,crema,hpcp,key,madmom,mfcc,tags} [{single_files,cens,crema,hpcp,key,madmom,mfcc,tags} ...]                                     \n                        which folder to download. for downloading multiple\n                        folders, you can enter multiple arguments (e.g. '--\n                        type cens crema'). for detailed explanation, please\n                        check https://mtg.github.io/da-tacos/ (default:\n                        single_files)                  \n  --source {gdrive,zenodo}\n                        from which source to download the files. you can\n                        either download from Google Drive (gdrive) or from\n                        Zenodo (zenodo) (default: gdrive)                                           \n  --outputdir OUTPUTDIR                                                               \n                        directory to store the dataset (default: ./)                   \n  --unpack              unpack the zip files (default: False)                        \n  --remove              remove zip files after unpacking (default: False) \n```\n\n### Loading the data in python\n\nAll files (except the metadata) are stored in `.h5` format. We recommend using `deepdish` library for python to load the files. An example of how to load the data is shown below:\n\n```python\nimport deepdish as dd\n\nfile_path = './da-tacos_coveranalysis_subset_single_files/W_14/P_15.h5'\nP_15_data = dd.io.load(file_path)\n```\n\n## Citing the dataset\n\nPlease cite the following [publication](http://archives.ismir.net/ismir2019/paper/000038.pdf) when using the dataset:\n\n\u003e Furkan Yesiler, Chris Tralie, Albin Correya, Diego F. Silva, Philip Tovstogan, Emilia Gómez, and Xavier Serra. Da-TACOS: A Dataset for Cover Song Identification and Understanding. In Proc. of the 20th Int. Soc. for Music Information Retrieval Conf. (ISMIR), pages 327-334, Delft, The Netherlands, 2019.\n\nBibtex version:\n\n```\n@inproceedings{yesiler2019,\n    author = \"Furkan Yesiler and Chris Tralie and Albin Correya and Diego F. Silva and Philip Tovstogan and Emilia G{\\'{o}}mez and Xavier Serra\",\n    title = \"{Da-TACOS}: A Dataset for Cover Song Identification and Understanding\",\n    booktitle = \"Proc. of the 20th Int. Soc. for Music Information Retrieval Conf. (ISMIR)\",\n    year = \"2019\",\n    pages = \"327--334\",\n    address = \"Delft, The Netherlands\"\n}\n```\n\n## License\n\n* The code in this repository is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) \n* The metadata and the pre-extracted features are licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)\n\nCopyright 2019 Music Technology Group\n\n## Acknowledgments\n\nThis work has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068 (MIP-Frontiers).\n\nThis work has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 770376 (TROMPA).\n\n\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/b/b7/Flag_of_Europe.svg\" height=\"64\" hspace=\"20\"\u003e\n\nOur logo uses svg vectors from https://www.svgrepo.com/.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtg%2Fda-tacos","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmtg%2Fda-tacos","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtg%2Fda-tacos/lists"}