{"id":14958157,"url":"https://github.com/rf5/transfusion-asr","last_synced_at":"2025-10-24T14:31:04.003Z","repository":{"id":112536617,"uuid":"550753584","full_name":"RF5/transfusion-asr","owner":"RF5","description":"Transcribing Speech with Multinomial Diffusion, training code and models.","archived":false,"fork":false,"pushed_at":"2023-09-27T10:40:42.000Z","size":182,"stargazers_count":66,"open_issues_count":0,"forks_count":5,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-06-05T18:47:54.712Z","etag":null,"topics":["asr","binomial-distribution","diffusion","discrete-diffusion","pytorch","speech-recognition"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RF5.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-13T09:16:25.000Z","updated_at":"2024-08-22T15:34:18.152Z","dependencies_parsed_at":"2023-09-27T13:45:36.864Z","dependency_job_id":null,"html_url":"https://github.com/RF5/transfusion-asr","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RF5%2Ftransfusion-asr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RF5%2Ftransfusion-asr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RF5%2Ftransfusion-asr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RF5%2Ftransfusion-asr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RF5","download_url":"https://codeload.github.com/RF5/transfusion-asr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237982270,"owners_count":19397230,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asr","binomial-distribution","diffusion","discrete-diffusion","pytorch","speech-recognition"],"created_at":"2024-09-24T13:16:23.170Z","updated_at":"2025-10-24T14:31:03.502Z","avatar_url":"https://github.com/RF5.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TransFusion: Transcribing Speech with Multinomial Diffusion\n\nThe official code repo! This repo contains code for training, inference, and scoring of TransFusion ASR models from our paper, \"_TransFusion: Transcribing Speech with Multinomial Diffusion_\".\nThe trained checkpoints are available under the ['Releases' tab](https://github.com/RF5/transfusion-asr/releases), although the quickstart below will download them for you. Hope you find this useful!\n\nLinks:\n\n- arXiv: [https://arxiv.org/abs/2210.07677](https://arxiv.org/abs/2210.07677)\n- SACAIR 2022 proceedings: https://2022.sacair.org.za/proceedings/\n\n![TransFusion architecture](./TransFusion.png)\n\nFigure: the TransFusion diagram showing both training and inference, as given in the paper. \n\nAuthors:\n\n- [Matthew Baas](https://rf5.github.io/)*\n- [Kevin Eloff](https://kevineloff.github.io/)*\n- [Herman Kamper](https://www.kamperh.com/)\n\n*equal contribution\n\n---\n\n## Quickstart\n\nWe use torch hub to make model loading very easy -- no cloning of the repo needed!\nThe steps to perform ASR inference with the trained checkpoint is simple:\n\n1. **Instal pip dependancies**: ensure `torch`, `torchaudio`, `numpy`, `omegaconf`, `fairseq`, `fastprogress`, `jiwer`, and `pandas` are installed (for full training dependencies see `requirements.txt`). Make sure you are using **python 3.10 or above**, this repo uses certain new features of python 3.10.\n2. **Load models**: load the trained TransFusion model and frozen WavLM encoder:\n  ```python\n  import torch\n  import torchaudio\n\n  device = 'cpu' # or 'cuda' if you have enough GPU memory.\n  wavlm = torch.hub.load('RF5/transfusion-asr', 'wavlm_large', device=device)\n  transfusion = torch.hub.load('RF5/transfusion-asr', 'transfusion_small_462k', device=device)\n  ```\n3. **Compute WavLM features**: load a 16kHz waveform and compute the WavLM features:\n\n  ```python\n  path = '\u003cpath to arbitrary 16kHz waveform\u003e.wav'\n  x, sr = torchaudio.load(pth)\n  assert sr == 16000\n  # get weighted WavLM features:\n  features = wavlm.extract_transfusion_features(x.to(device), wavlm) # (seq_len, dim)\n  ```\n4. **Predict transcript**: Perform multinomial diffusion using all the additional techniques from the paper:\n\n  ```python\n  pred_inds, pred_text = transfusion.perform_simple_inference(\n      transfusion, # pass in model to use in diffusion\n      features[None],  # add batch dimension to features\n      transfusion.diffuser, # diffuser containing diffusion parameters\n      transfusion.vocab, # vocab for converting indices to text / text to indices\n      transfusion.cfg # model/diffusion config dict\n  )\n  print(pred_text)\n  # prints out the predicted transcript of your utterance!\n  ```\n\nThat's it, trivial!\nYou can modify the diffusion parameters using the `DSH` class in `transfusion/score.py` and in the diffuser config. By default it uses the optimal settings found in the paper. \n\n\n## Checkpoints\n\nUnder the releases tab of this repo we provide two checkpoints:\n\n- The frozen WavLM encoder taken from the original WavLM authors, which we host here for convenience and torch hub integration.\n- The best TransFusion model presented in the paper, i.e. the model trained for 462k updates. \n\nThe performance on the Librispeech test set is summarized:\n\n| checkpoint | Params (M)| LS test-clean WER (%) | LS test-other WER (%) |\n| ----------- | :----: | :-----------: | :----: | \n| `transfusion_small_462k`   | 253 | 6.7 | 8.8 | \n\n## Training\n\nFor training you must also install [`deepspeed`](https://www.deepspeed.ai/).\n\n### Preparing data\n\nBefore training, one needs to prepare the data. The steps to do that for the LibriSpeech dataset is:\n\n1. First download and extract the [LibriSpeech](http://www.openslr.org/12) dataset. \n\n2. Then extract the WavLM features with the `extract.py` script:\n\n  ```\n  usage: python -m wavlm.extract [--librispeech_path PATH/TO/LIBRESPEECH] [--ckpt_path PATH/TO/WAVLM_LARGE_CKPT] [--out_path PATH/TO/FEAT]\n\n  required arguments:\n      --librispeech_path          root path of librispeech dataset\n      --out_path                  target directory to save WavLM features into\n      --ckpt_path                 path to pretrained WavLM checkpoint\n\n  optional arguments:\n      --seed \n      --device                    \n  ```\n\n3. Split data into train, validation, and test splits using `split_data.py` script:\n\n  ```\n  usage: split_data.py --librispeech_path LIBRISPEECH_PATH --ls_wavlm_path LS_WAVLM_PATH [--include_test]\n\n  Generate train \u0026 valid csvs from dataset directories\n\n  options:\n    --librispeech_path LIBRISPEECH_PATH\n                          path to root of librispeech dataset\n    --ls_wavlm_path LS_WAVLM_PATH\n                          path to root of WavLM features extracted using extract.py\n    --include_test        include processing and saving test.csv for test subsets\n  ```\n  \n  Running this will save the train/valid/test csv files and a vocabulary dict as `vocab.pt` into a `./splits/` folder.\n\nNow you are ready to get training!\n\n### Training\n\nThe training, model, and distributed computing config is specified in `transfusion/config`, `deepspeed_cfg.json`, and `train.py`.\nTo train the model according to the paper specification, use the following deepspeed command to train using `train.py`:\n\n```\ndeepspeed --num_nodes 1 train.py train_csv=splits/train.csv valid_csv=splits/valid.csv  checkpoint_path=runs/pog-debug/ vocab_path=splits/vocab.pt batch_size=12  --deepspeed --deepspeed_config=deepspeed_cfg.json validation_interval=20000 checkpoint_interval=20000\n```\n\nThat's it! Now both logs and checkpoints will be saved into the `checkpoint_path` and the `output_path` specified in `deepspeed_cfg.json`.\n\nYou can get a detailed score of a trained checkpoint using the `transfusion/score.py` script (see its help message for usage), which is what is used to perform the final Librispeech evaluations. It contains all the special decoding strategies introduced in the paper as well as the main decoding hyperparameters.\n\n### Repository structure:\n\nThe repository is organized as follows:\n\n\n```\n├── transfusion\n│   ├── config.py                   # hyperparameters\n│   ├── dataset.py                  # data loading and processing\n│   ├── diffusion.py                # diffusion helper functions\n│   ├── eval.py                     # logging and evaluation metrics\n│   ├── model.py                    # model definition\n│   ├── score.py                    # evaluation function\n│   ├── utils.py                    # training helper functions\n│   └── wavlm_modules.py            # wavlm model modules (from original WavLM repo)\n├── wavlm\n│   ├── extract.py                  # wavlm feature extraction script\n│   ├── modules.py                  # wavlm helper functions (from original WavLM repo)\n│   └── WavLM.py                    # wavlm modules (from original WavLM repo)\n├── deepspeed_cfg.json              # deepspeed config\n├── hubconf.py                      # torchhub integration\n├── README.md\n├── requirements.txt\n├── split_data.py                   # splits data into train/valid/test subsets\n├── train.py                        # main training script\n└── TransFusion.png                 # TransFusion model\n```\n\n\n## Acknowledgements\n\nParts of code for this project are adapted from the following repositories -- please make sure to check them out! Thank you to the authors of:\n\n- https://github.com/andreas128/RePaint\n- https://github.com/ehoogeboom/multinomial_diffusion\n- https://github.com/microsoft/unilm/tree/master/wavlm\n\n\u003c!-- All experiments were performed on Stellenbosch University's High Performance Computing (HPC) cluster. --\u003e\n\n## Citation\n\n\n```bibtex\n@inproceedings{baas2022transfusion,\n  title={TransFusion: Transcribing Speech with Multinomial Diffusion},\n  author={Baas, Matthew and Eloff, Kevin and Kamper, Herman},\n  booktitle={SACAIR},\n  year=2022\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frf5%2Ftransfusion-asr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frf5%2Ftransfusion-asr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frf5%2Ftransfusion-asr/lists"}