{"id":48274860,"url":"https://github.com/idiap/speech-utility-bioacoustics","last_synced_at":"2026-04-04T22:25:35.719Z","repository":{"id":265247981,"uuid":"832625045","full_name":"idiap/speech-utility-bioacoustics","owner":"idiap","description":"On the utility of speech and audio foundation models for marmoset call analysis","archived":false,"fork":false,"pushed_at":"2024-11-28T13:43:46.000Z","size":779,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-05T00:20:22.570Z","etag":null,"topics":["audio","bio-acoustics","representation-learning","self-supervised-learning","speech"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/idiap.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-07-23T11:53:19.000Z","updated_at":"2025-01-08T12:40:04.000Z","dependencies_parsed_at":"2024-11-28T14:38:33.779Z","dependency_job_id":null,"html_url":"https://github.com/idiap/speech-utility-bioacoustics","commit_stats":null,"previous_names":["idiap/speech-utility-bioacoustics"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/idiap/speech-utility-bioacoustics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fspeech-utility-bioacoustics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fspeech-utility-bioacoustics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fspeech-utility-bioacoustics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fspeech-utility-bioacoustics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/idiap","download_url":"https://codeload.github.com/idiap/speech-utility-bioacoustics/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/idiap%2Fspeech-utility-bioacoustics/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31416763,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T20:09:54.854Z","status":"ssl_error","status_checked_at":"2026-04-04T20:09:44.350Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audio","bio-acoustics","representation-learning","self-supervised-learning","speech"],"created_at":"2026-04-04T22:25:35.050Z","updated_at":"2026-04-04T22:25:35.693Z","avatar_url":"https://github.com/idiap.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis\n\n\n[[Paper]](https://arxiv.org/abs/2407.16417)\n[[Slides]](https://eklavyafcb.github.io/docs/Sarkar_Interspeech_2024_Presentation.pdf)\n\n\u003cp align=\"center\"\u003e\n\n[![python](https://img.shields.io/badge/-Python_3.9-blue?logo=python\u0026logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![pytorch](https://img.shields.io/badge/PyTorch_2.0+-ee4c2c?logo=pytorch\u0026logoColor=white)](https://pytorch.org/get-started/locally/)\n[![lightning](https://img.shields.io/badge/-Lightning_2.0+-792ee5?logo=pytorchlightning\u0026logoColor=white)](https://pytorchlightning.ai/)\n[![hydra](https://img.shields.io/badge/Config-Hydra_1.3-89b8cd)](https://hydra.cc/)\n[![black](https://img.shields.io/badge/Code%20Style-Black-black.svg?labelColor=gray)](https://black.readthedocs.io/en/stable/)\n[![isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![license](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://github.com/idiap/speech-utility-bioacoustics/blob/main/LICENSE)\n[![license](https://img.shields.io/badge/GitHub-Open%20source-green)](tps://github.com/speech-utility-bioacoustics/)\n\u003cbr\u003e\u003cbr\u003e\n\u003cimg src=\"img/figure.jpg\" alt=\"header\" width=\"1000\"/\u003e\n\n\u003c/p\u003e\n\n## Cite\n\nThis repository contains the source code for the **ISCA Interspeech 2024** paper [On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis](https://vihar-2024.vihar.org/proceedings/) by E. Sarkar and M. Magimai Doss (2024). It was accepted at the _4th International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR 2024)_ workshop track.\n\nPlease cite the original authors for their work in any publication(s) that uses this work:\n\n```bib\n@inproceedings{sarkar24_vihar,\n  title     = {On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis},\n  author    = {Eklavya Sarkar and Mathew Magimai.-Doss},\n  year      = {2024},\n  booktitle = {4th International Workshop on Vocal Interactivity In-and-between Humans, Animals and Robots (VIHAR2024)},\n  doi       = {10.5281/zenodo.13935495},\n  isbn      = {978-2-9562029-3-6},\n}\n```\n\n## Dataset\n\nInfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher. It contains a total of 169,318 labeled audio segments, which amounts to 72,921 vocalization segments once removing the \"Silence\" and \"Noise\" classes. There are 11 different call-types (excluding \"Silence\" and \"Noise\") and 10 different caller identities. \n\nThe dataset is publicly available [here](https://www.idiap.ch/en/dataset/infantmarmosetsvox/index_html), and contains a usable Pytorch `Dataset` and `Dataloader`. Any publication (eg. conference paper, journal article, technical report, book chapter, etc) resulting from the usage of InfantsMarmosetVox **must cite** this [paper](https://www.isca-speech.org/archive/interspeech_2023/sarkar23_interspeech.html):\n\n```bib\n@inproceedings{sarkar23_interspeech,\n  title     = {Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?},\n  author    = {Eklavya Sarkar and Mathew Magimai.-Doss},\n  year      = {2023},\n  booktitle = {INTERSPEECH 2023},\n  pages     = {1189--1193},\n  doi       = {10.21437/Interspeech.2023-1968},\n  issn      = {2958-1796},\n}\n```\n\nMore information on the usage is provided in the `README.txt` file of the dataset.\n\n## Installation\n\nThis package has very few requirements. \nTo create a new conda/mamba environment, install [conda](https://conda.io), then [mamba](https://mamba.readthedocs.io/en/latest/installation.html#existing-conda-install), and simply follow the next steps:\n\n```bash\n# Clone project\ngit clone https://github.com/idiap/speech-utility-bioacoustics\ncd speech-utility-bioacoustics\n\n# Create and activate environment\nmamba env create -f environment.yml\nmamba activate marmosets\n```\n\n## Usage\nTrain model with chosen experiment configuration from [configs/experiment/](configs/experiment/)\n\n```bash\npython src/train.py experiment=experiment_name.yaml\n```\n\nYou can override any parameter from command line like this\n\n```bash\npython src/train.py trainer.max_epochs=20\n```\n\n## Experiments\nThe experiments conducted in this paper can be found in the [scripts](scripts) folder. These contain feature extraction, pairwise distance computation, and training scripts. \n\nSample run:\n\n```bash\n$ ./scripts/train/wavlm.sh\n```\n\nThese use [gridtk](https://pypi.org/project/gridtk/) but can be reconfigured according to the user's needs.\n\n## Directory Structure\n\nThe structure of this directory is organized as the following:\n\n```\n.\n├── CITATION.cff            # Setup\n├── configs                 # Experiment configs\n├── environment.yaml        # Environment file\n├── hydra_plugins           # Plugins\n├── img                     # Images\n├── LICENSE                 # License\n├── Makefile                # Setup\n├── MANIFEST.in             # Setup\n├── pyproject.toml          # Setup\n├── README.md               # This file\n├── requirements.txt        # Requirements\n├── scripts                 # Scripts\n├── setup.py                # Setup\n├── src                     # Python source code\n└── version.txt             # Version\n\n```\n\n## Contact\n\nFor questions or reporting issues to this software package, kindly contact the first [author](mailto:eklavya.sarkar@idiap.ch).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fspeech-utility-bioacoustics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fidiap%2Fspeech-utility-bioacoustics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fidiap%2Fspeech-utility-bioacoustics/lists"}