{"id":23726035,"url":"https://github.com/voidful/hubert-cluster-code","last_synced_at":"2025-07-03T13:06:58.829Z","repository":{"id":109431936,"uuid":"384675547","full_name":"voidful/hubert-cluster-code","owner":"voidful","description":"Extract clustering feature from hubert","archived":false,"fork":false,"pushed_at":"2021-09-13T04:30:57.000Z","size":57507,"stargazers_count":5,"open_issues_count":0,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-23T17:08:57.878Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/voidful.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-10T10:54:19.000Z","updated_at":"2024-10-17T09:12:25.000Z","dependencies_parsed_at":"2023-03-13T14:12:51.927Z","dependency_job_id":null,"html_url":"https://github.com/voidful/hubert-cluster-code","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/voidful/hubert-cluster-code","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fhubert-cluster-code","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fhubert-cluster-code/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fhubert-cluster-code/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fhubert-cluster-code/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/voidful","download_url":"https://codeload.github.com/voidful/hubert-cluster-code/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/voidful%2Fhubert-cluster-code/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263331773,"owners_count":23450155,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-31T00:18:15.643Z","updated_at":"2025-07-03T13:06:58.814Z","avatar_url":"https://github.com/voidful.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# hubert-cluster-code\nReference https://github.com/pytorch/fairseq/tree/master/examples/hubert/simple_kmeans   \n\n## Usage: Extract hubert code from clustering result\n`wget https://raw.githubusercontent.com/voidful/hubert-cluster-code/main/km_feat_100/km_feat_100_layer_20`\n\n```python\nfrom transformers import Wav2Vec2FeatureExtractor, HubertModel\nfrom datasets import load_dataset\nimport soundfile as sf\n\nprocessor = Wav2Vec2FeatureExtractor.from_pretrained(\"facebook/hubert-large-ll60k\")\nmodel = HubertModel.from_pretrained(\"facebook/hubert-large-ll60k\")\n\ndef map_to_array(batch):\n    speech, _ = sf.read(batch[\"file\"])\n    batch[\"speech\"] = speech\n    return batch\nds = load_dataset(\"patrickvonplaten/librispeech_asr_dummy\", \"clean\", split=\"validation\")\nds = ds.map(map_to_array)\n\ninput_values = processor(ds[\"speech\"][0], return_tensors=\"pt\").input_values  \nhidden_states = model(input_values,output_hidden_states=True).hidden_states\n\n\nimport numpy as np\nimport joblib\nimport torch\n\nclass ApplyKmeans(object):\n    def __init__(self, km_path, return_diff=False):\n        self.km_model = joblib.load(km_path)\n        self.C_np = self.km_model.cluster_centers_.transpose()\n        self.Cnorm_np = (self.C_np ** 2).sum(0, keepdims=True)\n        self.return_diff = return_diff\n        self.C = torch.from_numpy(self.C_np)\n        self.Cnorm = torch.from_numpy(self.Cnorm_np)\n        if torch.cuda.is_available():\n            self.C = self.C.cuda()\n            self.Cnorm = self.Cnorm.cuda()\n\n    def __call__(self, x):\n        if isinstance(x, torch.Tensor):\n            dist = torch.sqrt(\n                x.pow(2).sum(1, keepdim=True)\n                - 2 * torch.matmul(x, self.C)\n                + self.Cnorm\n            )\n            min_dist = dist.detach().min(dim=1)\n            if self.return_diff:\n                return min_dist.indices.cpu().numpy(), min_dist.values.cpu().numpy()\n            else:\n                return min_dist.indices.cpu().numpy()\n        else:\n            dist = np.sqrt(\n                (x ** 2).sum(1, keepdims=True)\n                - 2 * np.matmul(x, self.C_np)\n                + self.Cnorm_np\n            )\n            if self.return_diff:\n                return np.argmin(dist, axis=1), np.min(dist, axis=1)\n            else:\n                return np.argmin(dist, axis=1)\n            \napply_kmeans = ApplyKmeans('./km_100h_c500/km_feat_layer_22')\napply_kmeans(hidden_states[22].squeeze().cuda())\n```\n\nor using asrp\n```python\nimport asrp\n\nhc = asrp.HubertCode(\"facebook/hubert-large-ll60k\", './km_100h_c500/km_feat_layer_22', 22)\nhc('voice file path')\n```\n\n## Calculate kmeans cluster\n```shell\nexport FAIRSEQ_ROOT=~/fairseq/\npython $FAIRSEQ_ROOT/examples/hubert/simple_kmeans/dump_mfcc_feature.py ./train_100  train 1 0 ./mfcc_feat_100\n# * This would shard the tsv file into ${nshard} and extract features for the ${rank}-th shard, where rank is an integer in [0, nshard-1]. Features would be saved at ${feat_dir}/${split}_${rank}_${nshard}.{npy,len}.\n\nwget https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt\npython $FAIRSEQ_ROOT/examples/hubert/simple_kmeans/dump_hubert_feature.py ./train_100 train hubert_large_ll60k.pt 20 1 0 ./hubert_feat_100_layer_20;\npython $FAIRSEQ_ROOT/examples/hubert/simple_kmeans/learn_kmeans.py ./hubert_feat_100_layer_20 train 1 ./km_feat_100_layer_20 500 --percent -1\n```\n\n## Validate clustering result\n```shell\npython $FAIRSEQ_ROOT/examples/wav2vec/wav2vec_manifest.py ./librisample/ --dest ./test_ds/ --valid-percent 0\npython $FAIRSEQ_ROOT/examples/hubert/simple_kmeans/dump_mfcc_feature.py ./test_ds/  train 1 0 ./mfcc_feat_test\npython $FAIRSEQ_ROOT/examples/hubert/simple_kmeans/dump_hubert_feature.py ./test_ds train hubert_large_ll60k.pt 20 1 0 ./hubert_feat_test;\npython $FAIRSEQ_ROOT/examples/hubert/simple_kmeans/dump_km_label.py ./hubert_feat_test train ./km_feat_100_layer_20 1 0 ./lab_dir_test\n```\n* check whether `./lab_dir_test` result is the same as above.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fhubert-cluster-code","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvoidful%2Fhubert-cluster-code","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvoidful%2Fhubert-cluster-code/lists"}