{"id":23891575,"url":"https://github.com/mush42/optispeech","last_synced_at":"2025-04-05T21:06:15.896Z","repository":{"id":247790417,"uuid":"825688593","full_name":"mush42/optispeech","owner":"mush42","description":"A lightweight end-to-end text-to-speech model","archived":false,"fork":false,"pushed_at":"2025-02-23T21:12:15.000Z","size":23100,"stargazers_count":110,"open_issues_count":10,"forks_count":13,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-03-29T20:03:03.223Z","etag":null,"topics":["convnext","e2e","efficient","lightweight","pytorch","tts"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mush42.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-08T10:14:00.000Z","updated_at":"2025-03-09T16:32:31.000Z","dependencies_parsed_at":"2024-08-27T22:04:06.031Z","dependency_job_id":"11b8a234-d21d-4350-b8bf-c28e3e25b539","html_url":"https://github.com/mush42/optispeech","commit_stats":null,"previous_names":["mush42/optispeech"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mush42%2Foptispeech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mush42%2Foptispeech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mush42%2Foptispeech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mush42%2Foptispeech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mush42","download_url":"https://codeload.github.com/mush42/optispeech/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247399871,"owners_count":20932876,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convnext","e2e","efficient","lightweight","pytorch","tts"],"created_at":"2025-01-04T12:34:23.808Z","updated_at":"2025-04-05T21:06:15.862Z","avatar_url":"https://github.com/mush42.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n[![python](https://img.shields.io/badge/-Python_3.10-blue?logo=python\u0026logoColor=white)](https://www.python.org/downloads/release/python-3100/)\n[![pytorch](https://img.shields.io/badge/PyTorch_2.0+-ee4c2c?logo=pytorch\u0026logoColor=white)](https://pytorch.org/get-started/locally/)\n[![lightning](https://img.shields.io/badge/-Lightning_2.0+-792ee5?logo=pytorchlightning\u0026logoColor=white)](https://pytorchlightning.ai/)\n[![hydra](https://img.shields.io/badge/Config-Hydra_1.3-89b8cd)](https://hydra.cc/)\n[![black](https://img.shields.io/badge/Code%20Style-Black-black.svg?labelColor=gray)](https://black.readthedocs.io/en/stable/)\n[![isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/)\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n# OptiSpeech: Lightweight End-to-End text-to-speech model\n\n**OptiSpeech** is ment to be an **efficient**, **lightweight** and **fast** text-to-speech model for **on-device** text-to-speech.\n\nI would like to thank [Pneuma Solutions](https://pneumasolutions.com/) for providing GPU resources for training this model. Their support significantly accelerated my development process.\n\u003c/div\u003e\n\n## Audio sample\n\n\nhttps://github.com/user-attachments/assets/e5001404-100f-4453-b979-8ea7d4b44659\n\n\nhttps://github.com/user-attachments/assets/7a0d7ff8-a02c-4e8a-a38f-3b083c7c28d4\n\n\nNote that this is still WIP. Final model designed decisions are still being made.\n\n## Installation\n\n### Inference-only\n\nIf you want an inference-only minimum -dependency package that doesn't require `pytorch`, you can use [ospeech](https://github.com/mush42/optispeech/tree/main/ospeech/)\n\n### Training and development\n\nWe use [uv](https://github.com/astral-sh/uv) to   manage the python runtime and dependencies.\n\n[Install `uv` first](https://docs.astral.sh/uv/getting-started/installation/), then run the following:\n\n```bash\n$ git clone https://github.com/mush42/optispeech\n$ cd optispeech\n$ uv sync\n```\n\n## Inference\n\n### Command line API\n\n```bash\n$ python3 -m optispeech.infer  --help\nusage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]\n                checkpoint text output_dir\n\nSpeaking text using OptiSpeech\n\npositional arguments:\n  checkpoint           Path to OptiSpeech checkpoint\n  text                 Text to synthesise\n  output_dir           Directory to write generated audio to.\n\noptions:\n  -h, --help           show this help message and exit\n  --d-factor D_FACTOR  Scale to control speech rate\n  --p-factor P_FACTOR  Scale to control pitch\n  --e-factor E_FACTOR  Scale to control energy\n  --cuda               Use GPU for inference\n```\n\n### Python API\n\n```python\nimport soundfile as sf\nfrom optispeech.model import OptiSpeech\n\n# Load model\ndevice = torch.device(\"cpu\")\nckpt_path = \"/path/to/checkpoint\"\nmodel = OptiSpeech.load_from_checkpoint(ckpt_path, map_location=\"cpu\")\nmodel = model.to(device)\nmodel = model.eval()\n\n# Text preprocessing and phonemization\nsentence = \"A rainbow is a meteorological phenomenon that is caused by reflection, refraction and dispersion of light in water droplets resulting in a spectrum of light appearing in the sky.\"\ninference_inputs = model.prepare_input(sentence)\ninference_outputs = model.synthesize(inference_inputs)\n\ninference_outputs = inference_outputs.as_numpy()\nwav = inference_outputs.wav\nsf.write(\"output.wav\", wav.squeeze(), model.sample_rate)\n```\n\n## Training\n\nSince this code uses [Lightning-Hydra-Template](https://github.com/ashleve/lightning-hydra-template), you have all the powers that come with it.\n\nTraining is easy as 1, 2, 3:\n\n### 1. Prepare Dataset\n\nGiven a dataset that is organized as follows:\n\n```bash\n├── train\n│   ├── metadata.csv\n│   └── wav\n│       ├── aud-00001-0003.wav\n│       └── ...\n└── val\n    ├── metadata.csv\n    └── wav\n        ├── aud-00764.wav\n        └── ...\n```\n\nThe `metadata.csv` file can contain 2, 3 or 4 columns delimited by **|** (bar character) in one of the following formats:\n\n- 2 columns: file_id|text\n- 3 columns: file_id|speaker_id|text\n- 4 columns: file_id|speaker_id|language_id|text\n\nUse the `preprocess_dataset` script to prepare the dataset for training:\n\n```bash\n$ python3 -m optispeech.tools.preprocess_dataset --help\nusage: preprocess_dataset.py [-h] [--format {ljspeech}] dataset input_dir output_dir\n\npositional arguments:\n  dataset              dataset config relative to `configs/data/` (without the suffix)\n  input_dir            original data directory\n  output_dir           Output directory to write datafiles + train.txt and val.txt\n\noptions:\n  -h, --help           show this help message and exit\n  --format {ljspeech}  Dataset format.\n```\n\nIf you are training on a new dataset, you must calculate and add **data_statistics ** using the following script:\n\n```bash\n$ python3 -m optispeech.tools.generate_data_statistics --help\nusage: generate_data_statistics.py [-h] [-b BATCH_SIZE] [-f] [-o OUTPUT_DIR] input_config\n\npositional arguments:\n  input_config          The name of the yaml config file under configs/data\n\noptions:\n  -h, --help            show this help message and exit\n  -b BATCH_SIZE, --batch-size BATCH_SIZE\n                        Can have increased batch size for faster computation\n  -f, --force           force overwrite the file\n  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\n                        Output directory to save the data statistics\n```\n\n### 2. [Optional] Choose your backbone\n\n**OptiSpeech** provides interchangeable types of backbones for the model's **encoder** and **decoder**, you can choose the backbone based on your target performance profile.\n\nTo help you choose, here's a quick computational-complexity analysis of the available backbones:\n\n| Backbone | Config File | FLOPs | MACs | #Params |\n| ---------- | ---------- | ---------- | ---------- | ---------- |\n| ConvNeXt | `optispeech.yaml` | 10.57 GFLOPS | 5.27 GMACs | 15.89 M |\n| Light | `light.yaml` | 7.88 GFLOPS | 3.93 GMACs | 10.74 M |\n| Transformer | `transformer.yaml` | 14.15 GFLOPS | 7.06 GMACs | 17.98 M |\n| Conformer | `conformer.yaml` | 20.42 GFLOPS | 10.19 GMACs | 24.35 M |\n\nThe default backbone is `ConvNeXt`, but if you want to change it you can edit your experiment config.\n\n### 3. Start training\n\nTo start training run the following command. Note that this training run uses **config** from [hfc_female-en_US](./configs/experiment/hfc_female-en_US.yaml). You can copy and update it with your own config values, and pass the name of the custom config file (without extension) instead.\n\n```bash\n$ python3 -m optispeech.train experiment=hfc_female-en_us\n```\n\n## ONNX support\n\n### ONNX export\n\n```bash\n$ python3 -m optispeech.onnx.export --help\nusage: export.py [-h] [--opset OPSET] [--seed SEED] checkpoint_path output\n\nExport OptiSpeech checkpoints to ONNX\n\npositional arguments:\n  checkpoint_path  Path to the model checkpoint\n  output           Path to output `.onnx` file\n\noptions:\n  -h, --help       show this help message and exit\n  --opset OPSET    ONNX opset version to use (default 15\n  --seed SEED      Random seed\n```\n\n### ONNX inference\n\n```bash\n$ python3 -m optispeech.onnx.infer --help\nusage: infer.py [-h] [--d-factor D_FACTOR] [--p-factor P_FACTOR] [--e-factor E_FACTOR] [--cuda]\n                onnx_path text output_dir\n\nONNX inference of OptiSpeech\n\npositional arguments:\n  onnx_path            Path to the exported LeanSpeech ONNX model\n  text                 Text to speak\n  output_dir           Directory to write generated audio to.\n\noptions:\n  -h, --help           show this help message and exit\n  --d-factor D_FACTOR  Scale to control speech rate.\n  --p-factor P_FACTOR  Scale to control pitch.\n  --e-factor E_FACTOR  Scale to control energy.\n  --cuda               Use GPU for inference\n```\n\n## Acknowledgements\n\nRepositories I would like to acknowledge:\n\n- [BetterFastspeech2](https://github.com/shivammehta25/betterfastspeech2): For repo backbone\n- [LightSpeech](https://github.com/microsoft/NeuralSpeech/tree/master/LightSpeech): for the transformer backbone\n- [JETS](https://github.com/espnet/espnet/tree/master/espnet2/gan_tts/jets): for the phoneme-mel alignment framework\n- [Vocos](https://github.com/gemelo-ai/vocos/): For pioneering the use of ConvNext in TTS\n- [Piper-TTS](https://github.com/rhasspy/piper): For leading the charge in on-device TTS. Also for the great phonemizer\n\n## Reference\n\n```\n@inproceedings{luo2021lightspeech,\n    title={Lightspeech: Lightweight and fast text to speech with neural architecture search},\n    author={Luo, Renqian and Tan, Xu and Wang, Rui and Qin, Tao and Li, Jinzhu and Zhao, Sheng and Chen, Enhong and Liu, Tie-Yan},\n    booktitle={ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n    pages={5699--5703},\n    year={2021},\n    organization={IEEE}\n}\n\n@article{siuzdak2023vocos,\n  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},\n  author={Siuzdak, Hubert},\n  journal={arXiv preprint arXiv:2306.00814},\n  year={2023}\n}\n\n@INPROCEEDINGS{10446890,\n  author={Okamoto, Takuma and Ohtani, Yamato and Toda, Tomoki and Kawai, Hisashi},\n  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  title={Convnext-TTS And Convnext-VC: Convnext-Based Fast End-To-End Sequence-To-Sequence Text-To-Speech And Voice Conversion},\n  year={2024},\n  volume={},\n  number={},\n  pages={12456-12460},\n  keywords={Vocoders;Neural networks;Signal processing;Transformers;Real-time systems;Acoustics;Decoding;ConvNeXt;JETS;text-to-speech;voice conversion;WaveNeXt},\n  doi={10.1109/ICASSP48485.2024.10446890}\n}\n```\n\n## Licence\n\nCopyright (c) Musharraf Omer. MIT Licence. See [LICENSE](./LICENSE) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmush42%2Foptispeech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmush42%2Foptispeech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmush42%2Foptispeech/lists"}