https://github.com/replicate/cog-hierspeechpp
Cog wrapper for HierSpeech++
https://github.com/replicate/cog-hierspeechpp
Last synced: 3 months ago
JSON representation
Cog wrapper for HierSpeech++
- Host: GitHub
- URL: https://github.com/replicate/cog-hierspeechpp
- Owner: replicate
- License: other
- Created: 2023-12-14T13:20:27.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-14T14:03:48.000Z (about 2 years ago)
- Last Synced: 2025-10-01T02:50:38.379Z (4 months ago)
- Language: Python
- Size: 310 KB
- Stars: 2
- Watchers: 2
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Cog wrapper for HierSpeech++
A Cog wrapper for HierSpeech++, a text-to-speech model that can generate speech from text and a target voice for zero-shot speech synthesis. See the original [repository](https://github.com/sh-lee-prml/HierSpeechpp), [paper](https://arxiv.org/abs/2311.12454) and [Replicate demo](https://replicate.com/adirik/hierspeechpp) for details.
## API Usage
You need to have Cog and Docker installed to run this model locally. Follow the [model pushing guide](https://replicate.com/docs/guides/push-a-model) to push your own fork of HierSpeech++ to [Replicate](https://replicate.com).
To use the model, simply provide the text you would like to generate speech and a sound file of your target voice as input. Optionally provide a reference speech (.mp3 or .wav) instead of text to parse speech content. The API returns an .mp3 file with generated speech.
To build the docker image with cog and run a prediction:
```bash
cog predict -i input_text="This is a zero-shot text to speech model." -i target_voice=@examples/reference_1.wav
```
To start a server and send requests to your locally or remotely deployed API:
```bash
cog run -p 5000 python -m cog.server.http
```
Input parameters are as follows:
- **input_text:** (optional) text input to the model. If provided, it will be used for the speech content of the output.
- **input_sound:** (optional) sound input to the model. If provided, it will be used for the speech content of the output..
- **target_voice:** a voice clip containing the speaker to synthesize.
- **denoise_ratio:** noise control. 0 means no noise reduction, 1 means maximum noise reduction. If noise reduction is desired, it is recommended to set this value to 0.6~0.8.
- **text_to_vector_temperature:** temperature for text-to-vector model. Larger value corresponds to slightly more random output.
- **output_sample_rate:** sample rate of the output audio file.
- **scale_output_volume:** scale normalization. If set to true, the output audio will be scaled according to the input sound if provided.
- **seed:** random seed to use for reproducibility.
## References
```
@article{Lee2023HierSpeechBT,
title={HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis},
author={Sang-Hoon Lee and Haram Choi and Seung-Bin Kim and Seong-Whan Lee},
journal={ArXiv},
year={2023},
volume={abs/2311.12454},
url={https://api.semanticscholar.org/CorpusID:265308903}
}
```