{"id":20021748,"url":"https://github.com/bshall/cpc","last_synced_at":"2025-05-05T01:30:58.867Z","repository":{"id":119398724,"uuid":"417433100","full_name":"bshall/cpc","owner":"bshall","description":"CPC-big and k-means clustering for zero-resource speech processing","archived":false,"fork":false,"pushed_at":"2021-10-15T16:33:03.000Z","size":12,"stargazers_count":7,"open_issues_count":2,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-08T14:45:48.966Z","etag":null,"topics":["contrastive-predictive-coding","self-supervised-learning","speech"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2108.00917","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bshall.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-15T08:55:50.000Z","updated_at":"2024-03-28T03:17:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"080d8489-eb07-480e-b14e-d6421a2a3b45","html_url":"https://github.com/bshall/cpc","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fcpc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fcpc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fcpc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bshall%2Fcpc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bshall","download_url":"https://codeload.github.com/bshall/cpc/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252423013,"owners_count":21745531,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["contrastive-predictive-coding","self-supervised-learning","speech"],"created_at":"2024-11-13T08:38:05.989Z","updated_at":"2025-05-05T01:30:58.852Z","avatar_url":"https://github.com/bshall.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Contrastive Predictive Coding\n\nThe CPC-big model and k-means checkpoints used in [Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing](https://arxiv.org/abs/2108.00917).\n\nContrastive predictive coding (CPC) aims to learn representations of speech by distinguishing future observations from a set of negative examples. \nPrevious work has shown that linear classifiers trained on CPC features can accurately predict speaker and phone labels. \nHowever, it is unclear how the features actually capture speaker and phonetic information, and whether it is possible to normalize out the irrelevant details (depending on the downstream task). \nIn this paper, we first show that the per-utterance mean of CPC features captures speaker information to a large extent. \nConcretely, we find that comparing means performs well on a speaker verification task. \nNext, probing experiments show that standardizing the features effectively removes speaker information. \nBased on this observation, we propose a speaker normalization step to improve acoustic unit discovery using K-means clustering of CPC features. \nFinally, we show that a language model trained on the resulting units achieves some of the best results in the ZeroSpeech2021~Challenge.\n\n## Basic Usage\n\n```python\nimport torch, torchaudio\nfrom sklearn.preprocessing import StandardScaler\n\n# Load model checkpoints\ncpc = torch.hub.load(\"bshall/cpc:main\", \"cpc\").cuda()\nkmeans = torch.hub.load(\"bshall/cpc:main\", \"kmeans50\")\n\n# Load audio\nwav, sr = torchaudio.load(\"path/to/wav\")\nassert sr == 16000\nwav = wav.unsqueeze(0).cuda()\n\nx = cpc.encode(wav).squeeze().cpu().numpy()  # Encode\nx = StandardScaler().fit_transform(x)  # Speaker normalize\ncodes = kmeans.predict(x)  # Discretize\n```\n\nNote that the `encode` function is stateful (keeps the hidden state of the LSTM from previous calls).\n\n## Encode an Audio Dataset\n\nClone the repo and use the `encode.py` script:\n\n```\nusage: encode.py [-h] in_dir out_dir\n\nEncode an audio dataset using CPC-big (with speaker normalization and discretization).\n\npositional arguments:\n  in_dir      Path to the directory to encode.\n  out_dir     Path to the output directory.\n\noptional arguments:\n  -h, --help  show this help message and exit\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbshall%2Fcpc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbshall%2Fcpc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbshall%2Fcpc/lists"}