{"id":20214308,"url":"https://github.com/yl4579/pl-bert","last_synced_at":"2025-04-12T20:45:37.396Z","repository":{"id":109947131,"uuid":"592570875","full_name":"yl4579/PL-BERT","owner":"yl4579","description":"Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions","archived":false,"fork":false,"pushed_at":"2025-01-13T23:31:25.000Z","size":3773,"stargazers_count":243,"open_issues_count":13,"forks_count":50,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-04-04T01:04:22.091Z","etag":null,"topics":["bert","bert-model","text-to-speech","tts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yl4579.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-01-24T02:27:11.000Z","updated_at":"2025-03-24T09:39:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"e6478f3f-e216-44ac-bbe7-b5347a73fa2e","html_url":"https://github.com/yl4579/PL-BERT","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yl4579%2FPL-BERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yl4579%2FPL-BERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yl4579%2FPL-BERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yl4579%2FPL-BERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yl4579","download_url":"https://codeload.github.com/yl4579/PL-BERT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248631687,"owners_count":21136556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","bert-model","text-to-speech","tts"],"created_at":"2024-11-14T06:15:11.324Z","updated_at":"2025-04-12T20:45:37.368Z","avatar_url":"https://github.com/yl4579.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions\n\n### Yinghao Aaron Li, Cong Han, Xilin Jiang, Nima Mesgarani\n\n\u003e Large-scale pre-trained language models have been shown to be helpful in improving the naturalness of text-to-speech (TTS) models by enabling them to produce more naturalistic prosodic patterns. However, these models are usually word-level or sup-phoneme-level and jointly trained with phonemes, making them inefficient for the downstream TTS task where only phonemes are needed. In this work, we propose a phoneme-level BERT (PL-BERT) with a pretext task of predicting the corresponding graphemes along with the regular masked phoneme predictions. Subjective evaluations show that our phoneme-level BERT encoder has significantly improved the mean opinion scores (MOS) of rated naturalness of synthesized speech compared with the state-of-the-art (SOTA) StyleTTS baseline on out-of-distribution (OOD) texts.\n\nPaper: [https://arxiv.org/abs/2301.08810](https://arxiv.org/abs/2301.08810)\n\nAudio samples: [https://pl-bert.github.io/](https://pl-bert.github.io/)\n\n## Pre-requisites\n1. Python \u003e= 3.7\n2. Clone this repository:\n```bash\ngit clone https://github.com/yl4579/PL-BERT.git\ncd PL-BERT\n```\n3. Create a new environment (recommended):\n```bash\nconda create --name BERT python=3.8\nconda activate BERT\npython -m ipykernel install --user --name BERT --display-name \"BERT\"\n```\n4. Install python requirements: \n```bash\npip install pandas singleton-decorator datasets \"transformers\u003c4.33.3\" accelerate nltk phonemizer sacremoses pebble\n```\n\n## Preprocessing\nPlease refer to the notebook [preprocess.ipynb](https://github.com/yl4579/PL-BERT/blob/main/preprocess.ipynb) for more details. The preprocessing is for English Wikipedia dataset only. I will make a new branch for Japanese if I have extra time to demostrate training on other languages. You may also refer to [#6](https://github.com/yl4579/PL-BERT/issues/6#issuecomment-1797869275) for preprocessing in other languages like Japanese. \n\n## Trianing\nPlease run each cell in the notebook [train.ipynb](https://github.com/yl4579/PL-BERT/blob/main/train.ipynb). You will need to change the line\n`config_path = \"Configs/config.yml\"` in cell 2 if you wish to use a different config file. The training code is in Jupyter notebook primarily because the initial epxeriment was conducted in Jupyter notebook, but you can easily make it a Python script if you want to. \n\n## Finetuning\nHere is an example of how to use it for StyleTTS finetuning. You can use it for other TTS models by replacing the text encoder with the pre-trained PL-BERT.\n1. Modify line 683 in [models.py](https://github.com/yl4579/StyleTTS/blob/main/models.py#L683) with the following code to load BERT model in to StyleTTS:\n```python\nfrom transformers import AlbertConfig, AlbertModel\n\nlog_dir = \"YOUR PL-BERT CHECKPOINT PATH\"\nconfig_path = os.path.join(log_dir, \"config.yml\")\nplbert_config = yaml.safe_load(open(config_path))\n\nalbert_base_configuration = AlbertConfig(**plbert_config['model_params'])\nbert = AlbertModel(albert_base_configuration)\n\nfiles = os.listdir(log_dir)\nckpts = []\nfor f in os.listdir(log_dir):\n    if f.startswith(\"step_\"): ckpts.append(f)\n\niters = [int(f.split('_')[-1].split('.')[0]) for f in ckpts if os.path.isfile(os.path.join(log_dir, f))]\niters = sorted(iters)[-1]\n        \ncheckpoint = torch.load(log_dir + \"/step_\" + str(iters) + \".t7\", map_location='cpu')\nstate_dict = checkpoint['net']\nfrom collections import OrderedDict\nnew_state_dict = OrderedDict()\nfor k, v in state_dict.items():\n    name = k[7:] # remove `module.`\n    if name.startswith('encoder.'):\n        name = name[8:] # remove `encoder.`\n        new_state_dict[name] = v\nbert.load_state_dict(new_state_dict)\n\nnets = Munch(bert=bert,\n  # linear projection to match the hidden size (BERT 768, StyleTTS 512)\n  bert_encoder=nn.Linear(plbert_config['model_params']['hidden_size'], args.hidden_dim),\n  predictor=predictor,\n    decoder=decoder,\n             pitch_extractor=pitch_extractor,\n                 text_encoder=text_encoder,\n                 style_encoder=style_encoder,\n             text_aligner = text_aligner,\n            discriminator=discriminator)\n```\n2. Modify line 126 in [train_second.py](https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L126) with the following code to adjust the learning rate of BERT model:\n```python\n# for stability\nfor g in optimizer.optimizers['bert'].param_groups:\n    g['betas'] = (0.9, 0.99)\n    g['lr'] = 1e-5\n    g['initial_lr'] = 1e-5\n    g['min_lr'] = 0\n    g['weight_decay'] = 0.01\n```\n3. Modify line 211 in [train_second.py](https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L211) with the following code to replace text encoder with BERT encoder:\n```python\n            bert_dur = model.bert(texts, attention_mask=(~text_mask).int()).last_hidden_state\n            d_en = model.bert_encoder(bert_dur).transpose(-1, -2)\n            d, _ = model.predictor(d_en, s, \n                                                    input_lengths, \n                                                    s2s_attn_mono, \n                                                    m)\n```\n[line 257](https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L257):\n```python\n            _, p = model.predictor(d_en, s, \n                                                    input_lengths, \n                                                    s2s_attn_mono, \n                                                    m)\n```\nand [line 415](https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L415):\n```python\n                bert_dur = model.bert(texts, attention_mask=(~text_mask).int()).last_hidden_state\n                d_en = model.bert_encoder(bert_dur).transpose(-1, -2)\n                d, p = model.predictor(d_en, s, \n                                                    input_lengths, \n                                                    s2s_attn_mono, \n                                                    m)\n```\n\n4. Modify line 347 in [train_second.py](https://github.com/yl4579/StyleTTS/blob/main/train_second.py#L347) with the following code to make sure parameters of BERT model are updated:\n```python\n            optimizer.step('bert_encoder')\n            optimizer.step('bert')\n```\n\nThe pre-trained PL-BERT on Wikipedia for 1M steps can be downloaded at: [PL-BERT link](https://github.com/yl4579/StyleTTS2/tree/main/Utils/PLBERT).\n\nThe demo on LJSpeech dataset along with the pre-modified StyleTTS repo and pre-trained models can be downloaded here: [StyleTTS Link](https://huggingface.co/yl4579/StyleTTS/blob/main/LJSpeech_PLBERT/Models.zip). This zip file contains the code modification above, the pre-trained PL-BERT model listed above, pre-trained StyleTTS w/ PL-BERT, pre-trained StyleTTS w/o PL-BERT and pre-trained HifiGAN on LJSpeech from the StyleTTS repo.\n\n## References\n- [NVIDIA/NeMo-text-processing](https://github.com/NVIDIA/NeMo-text-processing)\n- [tomaarsen/TTSTextNormalization](https://github.com/tomaarsen/TTSTextNormalization)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyl4579%2Fpl-bert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyl4579%2Fpl-bert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyl4579%2Fpl-bert/lists"}