{"id":13407907,"url":"https://github.com/awslabs/speech-representations","last_synced_at":"2025-03-14T12:31:50.382Z","repository":{"id":53675974,"uuid":"261021490","full_name":"awslabs/speech-representations","owner":"awslabs","description":"Code for DeCoAR (ICASSP 2020) and BERTphone (Odyssey 2020)","archived":false,"fork":false,"pushed_at":"2022-11-26T19:40:47.000Z","size":36,"stargazers_count":103,"open_issues_count":0,"forks_count":14,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-03-13T02:37:56.866Z","etag":null,"topics":["deep-learning","nlp","speech-recognition"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1912.01679","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/awslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-03T21:10:35.000Z","updated_at":"2024-05-29T02:21:37.000Z","dependencies_parsed_at":"2023-01-22T16:31:03.651Z","dependency_job_id":null,"html_url":"https://github.com/awslabs/speech-representations","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fspeech-representations","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fspeech-representations/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fspeech-representations/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fspeech-representations/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/awslabs","download_url":"https://codeload.github.com/awslabs/speech-representations/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243578269,"owners_count":20313794,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","nlp","speech-recognition"],"created_at":"2024-07-30T20:00:49.402Z","updated_at":"2025-03-14T12:31:50.079Z","avatar_url":"https://github.com/awslabs.png","language":"Python","funding_links":[],"categories":["Paper List","Python"],"sub_categories":["Generative"],"readme":"# Speech Representations\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n\nModels and code for deep learning representations developed by the AWS AI Speech team:\n\n- [DeCoAR (self-supervised contextual representations for speech recognition)](https://arxiv.org/abs/1912.01679)\n- [BERTphone (phonetically-aware acoustic BERT for speaker and language recognition)](https://www.isca-speech.org/archive/Odyssey_2020/abstracts/93.html)\n- [DeCoAR 2.0 (deep contextualized acoustic representation with vector quantization)](https://arxiv.org/abs/2012.06659)\n\n**NOTE: This repo is not actively maintained. For future experiments with DeCoAR and DeCoAR 2.0, we suggest using [the S3PRL speech toolkit](https://github.com/s3prl/s3prl), which has active and standardized featurizer/upstream/downstream wrappers for these models.**\n\n## Installation\n\nWe provide a library and CLI to featurize speech utterances. We hope to release training/fine-tuning code in the future.\n\n[Kaldi](https://github.com/kaldi-asr/kaldi) should be installed to `kaldi/`, or `$KALDI_ROOT` should be set.\n\nWe expect Python 3.6+. The BERTphone model are defined in MXNet and our DeCoAR models are defined in Pytorch. Clone this repository, then:\n```sh\npip install -e .\n# For DeCoAR\npip install torch fairseq\n# For BERTphone\npip install mxnet-mkl~=1.6.0   # ...or mxnet-cu102mkl for GPU w/ CUDA 10.2, etc.\npip install gluonnlp # optional; for featurizing with bertphone\n```\n\n\n## Pre-trained models\n\nFirst, download the model weights:\n```sh\nmkdir artifacts\ncd artifacts\n# For DeCoAR trained on LibriSpeech (257M)\nwget https://github.com/awslabs/speech-representations/releases/download/decoar/checkpoint_decoar.pt\n# For BERTphone 8KHz (λ=0.2) trained on Fisher\nwget https://github.com/awslabs/speech-representations/releases/download/bertphone/bertphone_fisher_02-87159543.params\n# For Decoar 2.0:\nwget https://github.com/awslabs/speech-representations/releases/download/decoar2/checkpoint_decoar2.pt\n\n```\nWe support featurizing individual files with the CLI:\n```sh\nspeech-reps featurize --model {decoar,bertphone,decoar2} --in-wav \u003cinput_file\u003e.wav --out-npy \u003coutput_file\u003e.npy\n# --params \u003cfile\u003e: load custom weights (otherwise use `artifacts/`)\n# --gpu \u003cint\u003e:     use GPU (otherwise use CPU)\n```\nor in code:\n```sh\nfrom speech_reps.featurize import DeCoARFeaturizer\n# Load the model on GPU 0\nfeaturizer = DeCoARFeaturizer('artifacts/checkpoint_decoar.pt', gpu=0)\n# Returns a (time, feature) NumPy array\ndata = featurizer.file_to_feats('my_wav_file.wav')\n```\n\n We plan to support Kaldi `.scp` and `.ark` files soon. For now, batches can be processed with the underlying `featurizer._model`.\n\n\n## References\n\nIf you found our package or pre-trained models useful, please cite the relevant work:\n\n**[DeCoAR](https://arxiv.org/abs/1912.01679)**\n```\n@inproceedings{decoar,\n  author    = {Shaoshi Ling and Yuzong Liu and Julian Salazar and Katrin Kirchhoff},\n  title     = {Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition},\n  booktitle = {{ICASSP}},\n  pages     = {6429--6433},\n  publisher = {{IEEE}},\n  year      = {2020}\n}\n```\n**[BERTphone](https://www.isca-speech.org/archive/Odyssey_2020/abstracts/93.html)**\n```\n@inproceedings{bertphone,\n  author    = {Shaoshi Ling and Julian Salazar and Yuzong Liu and Katrin Kirchhoff},\n  title     = {BERTphone: Phonetically-aware Encoder Representations for Speaker and Language Recognition},\n  booktitle = {{Speaker Odyssey}},\n  publisher = {{ISCA}},\n  year      = {2020}\n}\n```\n**[DeCoAR 2.0](https://arxiv.org/abs/2012.06659)**\n```\n@misc{ling2020decoar,\n      title={DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization}, \n      author={Shaoshi Ling and Yuzong Liu},\n      year={2020},\n      eprint={2012.06659},\n      archivePrefix={arXiv},\n      primaryClass={eess.AS}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fspeech-representations","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fawslabs%2Fspeech-representations","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fspeech-representations/lists"}