{"id":17157544,"url":"https://github.com/andi611/mockingjay-speech-representation","last_synced_at":"2025-07-17T06:04:06.914Z","repository":{"id":35779466,"uuid":"270729896","full_name":"andi611/Mockingjay-Speech-Representation","owner":"andi611","description":"Official Implementation of Mockingjay in Pytorch","archived":false,"fork":false,"pushed_at":"2023-07-06T21:29:03.000Z","size":1640,"stargazers_count":54,"open_issues_count":4,"forks_count":12,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-13T13:38:47.168Z","etag":null,"topics":["apc","feature-extraction","mockingjay","phone-classification","phoneme-prediction","pytorch","pytorch-implementation","representation-learning","sentiment-classification","speaker-classification","speaker-recognition","speech","speech-representation"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1910.12638","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andi611.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-06-08T15:55:35.000Z","updated_at":"2025-03-13T14:48:07.000Z","dependencies_parsed_at":"2025-04-13T13:43:12.606Z","dependency_job_id":null,"html_url":"https://github.com/andi611/Mockingjay-Speech-Representation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/andi611/Mockingjay-Speech-Representation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andi611%2FMockingjay-Speech-Representation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andi611%2FMockingjay-Speech-Representation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andi611%2FMockingjay-Speech-Representation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andi611%2FMockingjay-Speech-Representation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andi611","download_url":"https://codeload.github.com/andi611/Mockingjay-Speech-Representation/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andi611%2FMockingjay-Speech-Representation/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265571061,"owners_count":23790015,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apc","feature-extraction","mockingjay","phone-classification","phoneme-prediction","pytorch","pytorch-implementation","representation-learning","sentiment-classification","speaker-classification","speaker-recognition","speech","speech-representation"],"created_at":"2024-10-14T22:09:13.687Z","updated_at":"2025-07-17T06:04:06.882Z","avatar_url":"https://github.com/andi611.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🦜 Mockingjay\n---------------------------------------------------------\n## IMPORTANT:\n- **This repo is a legacy version of when the Mockingjay paper is first released.**\n- **For our improved and maintaining implementation of Mockingjay, please visit the [The S3PRL project](https://github.com/s3prl/s3prl).**\n---------------------------------------------------------\n\n### Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders\n[![Bitbucket open issues](https://img.shields.io/bitbucket/issues/andi611/Mockingjay-Speech-Representation)](https://github.com/andi611/Mockingjay-Speech-Representation/issues)\n[![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![GitHub](https://img.shields.io/github/license/andi611/Mockingjay-Speech-Representation)](https://en.wikipedia.org/wiki/MIT_License)\n\nThis is an open source project for Mockingjay, an unsupervised algorithm for learning speech representations introduced and described in the paper [\"Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders\"](https://arxiv.org/abs/1910.12638).\n\u003cimg src=\"https://github.com/andi611/Mockingjay-Speech-Representation/blob/master/paper/training.png\"\u003e\n\nFeel free to use or modify them, any bug report or improvement suggestion will be appreciated. If you have any questions, please contact f07942089@ntu.edu.tw. If you find this project helpful for your research, please do consider to cite [this paper](#Citation), thanks!\n\n# Highlight\n## Pre-trained Models\nYou can find pre-trained models here:\n\n **[http://bit.ly/result_mockingjay](http://bit.ly/result_mockingjay)**\n\n Their usage are explained bellow and furthur in [Step 3 of the Instruction Section](#Instructions).\n\n## Extracting Speech Representations\nWith this repo and the trained models, you can use it to extract speech representations from your target dataset. To do so, feed-forward the trained model on the target dataset and retrieve the extracted features by running the following example python code ([example_extract.py](example_extract.py)):\n```python\nimport torch\nfrom runner_mockingjay import get_mockingjay_model\n\nexample_path = 'result/result_mockingjay/mockingjay_libri_sd1337_LinearLarge/mockingjay-500000.ckpt'\nmockingjay = get_mockingjay_model(from_path=example_path)\n\n# A batch of spectrograms: (batch_size, seq_len, hidden_size)\nspec = torch.zeros(3, 800, 160)\n\n# reps.shape: (batch_size, num_hiddem_layers, seq_len, hidden_size)\nreps = mockingjay.forward(spec=spec, all_layers=True, tile=True)\n\n# reps.shape: (batch_size, num_hiddem_layers, seq_len // downsample_rate, hidden_size)\nreps = mockingjay.forward(spec=spec, all_layers=True, tile=False)\n\n# reps.shape: (batch_size, seq_len, hidden_size)\nreps = mockingjay.forward(spec=spec, all_layers=False, tile=True)\n\n# reps.shape: (batch_size, seq_len // downsample_rate, hidden_size)\nreps = mockingjay.forward(spec=spec, all_layers=False, tile=False)\n```\n`spec` is the input spectrogram of the mockingjay model where:\n- `spec` needs to be a PyTorch tensor with shape of `(seq_len, mel_dim)` or `(batch_size, seq_len, mel_dim)`.\n- `mel_dim` is the spectrogram feature dimension which by default is `mel_dim == 160`, see [utility/audio.py](utility/audio.py) for more preprocessing details.\n\n`reps` is a PyTorch tensor of various possible shapes where:\n- `batch_size` is the inference batch size.\n- `num_hiddem_layers` is the transformer encoder depth of the mockingjay model.\n- `seq_len` is the maximum sequence length in the batch.\n- `downsample_rate` is the dimensionality of the transformer encoder layers.\n- `hidden_size` is the number of stacked consecutive features vectors to reduce the length of input sequences.\n\nThe output shape of `reps` is determined by the two arguments:\n- `all_layers` is a boolean which controls whether to output all the Encoder layers, if `False` returns the hidden of the last Encoder layer.\n- `tile` is a boolean which controls whether to tile representations to match the input `seq_len` of `spec`.\n\nAs you can see, `reps` is essentially the Transformer Encoder hidden representations in the mockingjay model. You can think of Mockingjay as a speech version of [BERT](https://arxiv.org/abs/1810.04805) if you are familiar with it.\n\nThere are many ways to incorporate `reps` into your downtream task. One of the easiest way is to take only the outputs of the last Encoder layer (i.e., `all_layers=False`) as the input features to your downstream model, feel free to explore other mechanisms.\n\n## Fine-tuning with your own downstream SLP tasks\nWith this repo and the trained models, you can fine-tune the pre-trained Mockingjay model on your own dataset and tasks. To do so, take a look at the following example python code ([example_finetune.py](example_finetune.py)):\n```python\nimport torch\nfrom runner_mockingjay import get_mockingjay_model\nfrom downstream.model import example_classifier\nfrom downstream.solver import get_mockingjay_optimizer\n\n# setup the mockingjay model\nexample_path = 'result/result_mockingjay/mockingjay_libri_sd1337_MelBase/mockingjay-500000.ckpt'\nsolver = get_mockingjay_model(from_path=example_path)\n\n# setup your downstream class model\n# features extracted from MelBase model have dimention 768\nclassifier = example_classifier(input_dim=768, hidden_dim=128, class_num=2).cuda()\n\n# construct the Mockingjay optimizer\nparams = list(solver.mockingjay.named_parameters()) + list(classifier.named_parameters())\noptimizer = get_mockingjay_optimizer(params=params, lr=4e-3, warmup_proportion=0.7, training_steps=50000)\n\n# forward\nexample_inputs = torch.zeros(3, 800, 160) # A batch of spectrograms: (batch_size, seq_len, hidden_size)\nreps = solver.forward_fine_tune(spec=example_inputs) # returns: (batch_size, seq_len, hidden_size)\nloss = classifier(reps, torch.LongTensor([0, 1, 0]).cuda())\n\n# update\nloss.backward()\noptimizer.step()\n\n# save\nPATH_TO_SAVE_YOUR_MODEL = 'example.ckpt'\nstates = {'Classifier': classifier.state_dict(), 'Mockingjay': solver.mockingjay.state_dict()}\ntorch.save(states, PATH_TO_SAVE_YOUR_MODEL)\n```\n\n# Requirements\n\n- Python 3\n- Pytorch 1.3.0 or above\n- Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is **extremely important** if you'd like to train your own model.\n- Required packages and their use are listed below, and also in [requirements.txt](requirements.txt):\n```\neditdistance     # error rate calculation\njoblib           # parallel feature extraction \u0026 decoding\nlibrosa          # feature extraction (for feature extraction only)\npydub            # audio segmentation (for MOSEI dataset preprocessing only)\npandas           # data management\ntensorboardX     # logger \u0026 monitor\ntorch            # model \u0026 learning\ntqdm             # verbosity\nyaml             # config parser\nmatplotlib       # visualization\nipdb             # optional debugger\nnumpy            # array computation\nscipy            # for feature extraction\n```\nThe above packages can be installed by the command:\n```bash\npip install -r requirements.txt\n```\nBelow we list packages that need special attention, and we recommand you to install them manually:\n```\napex             # non-essential, faster optimization (only needed if enabled in config)\nsentencepiece    # sub-word unit encoding (for feature extraction only, see https://github.com/google/sentencepiece#build-and-install-sentencepiece for install instruction)\n```\n\n# Instructions\n\n***Before you start, make sure all the packages required listed above are installed correctly***\n\n### Step 0. Preprocessing - Acoustic Feature Extraction \u0026 Text Encoding\n\nSee the instructions on the [Preprocess wiki page](https://github.com/andi611/Mockingjay-Speech-Representation/wiki/Mockingjay-Preprocessing-Instructions) for preprocessing instructions.\n\n### Step 1. Configuring - Model Design \u0026 Hyperparameter Setup\n\nAll the parameters related to training/decoding will be stored in a yaml file. Hyperparameter tuning and massive experiment and can be managed easily this way. See [config files](config/) for the exact format and examples.\n\n### Step 2. Training the Mockingjay Model for Speech Representation Learning\n\nOnce the config file is ready, run the following command to train unsupervised end-to-end Mockingjay:\n```bash\npython3 runner_mockingjay.py --train\n```\nAll settings will be parsed from the config file automatically to start training, the log file can be accessed through TensorBoard.\n\n### Step 3. Using Pre-trained Models on Downstream Tasks\n\nOnce a Mockingjay model was trained, we can use the generated representations on downstream tasks.\nSee the [Experiment section](#Experiments) for reproducing downstream task results mentioned in our paper, and see the [Highlight section](#Highlight) for incorporating the extracted representations with your own downstream task.\n\nPre-trained models and their configs can be download from [HERE](http://bit.ly/result_mockingjay).\nTo load with default path, models should be placed under the directory path: `--ckpdir=./result_mockingjay/` and name the model file manually with `--ckpt=`.\n\n### Step 4. Loading Pre-trained Models and Visualize\nRun the following command to visualize the model generated samples:\n```bash\n# visualize hidden representations\npython3 runner_mockingjay.py --plot\n# visualize spectrogram\npython3 runner_mockingjay.py --plot --with_head\n```\nNote that the arguments ```--ckpdir=XXX --ckpt=XXX``` needs to be set correctly for the above command to run properly.\n\n### Step 5. Monitor Training Log\n```bash\n# open TensorBoard to see log\ntensorboard --logdir=log/log_mockingjay/mockingjay_libri_sd1337/\n# or\npython3 -m tensorboard.main --logdir=log/log_mockingjay/mockingjay_libri_sd1337/\n```\n\n## Experiments\n\n### Application on downstream tasks\nSee the instructions on the [Downstream wiki page](https://github.com/andi611/Mockingjay-Speech-Representation/wiki/Downstream-Task-Instructions) to reproduce our experiments.\n\n### Comparing with APC\nSee the instructions on the [APC wiki page](https://github.com/andi611/Mockingjay-Speech-Representation/wiki/Reproducing-APC-to-compare-with-Mockingjay) to reproduce our experiments.\n\n\n# Reference\n1. [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/), McAuliffe et. al.\n2. [CMU MultimodalSDK](https://github.com/A2Zadeh/CMU-MultimodalSDK/blob/master/README.md), Amir Zadeh.\n3. [PyTorch Transformers](https://github.com/huggingface/pytorch-transformers), Hugging Face.\n4. [Autoregressive Predictive Coding](https://arxiv.org/abs/1904.03240), Yu-An Chung.\n5. [End-to-end ASR Pytorch](https://github.com/Alexander-H-Liu/End-to-end-ASR-Pytorch), Alexander-H-Liu.\n6. [Tacotron Preprocessing](https://github.com/r9y9/tacotron_pytorch), Ryuichi Yamamoto (r9y9)\n\n## Citation\n```\n@article{Liu_2020,\n   title={Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders},\n   ISBN={9781509066315},\n   url={http://dx.doi.org/10.1109/ICASSP40776.2020.9054458},\n   DOI={10.1109/icassp40776.2020.9054458},\n   journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n   publisher={IEEE},\n   author={Liu, Andy T. and Yang, Shu-wen and Chi, Po-Han and Hsu, Po-chun and Lee, Hung-yi},\n   year={2020},\n   month={May}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandi611%2Fmockingjay-speech-representation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandi611%2Fmockingjay-speech-representation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandi611%2Fmockingjay-speech-representation/lists"}