{"id":13586174,"url":"https://github.com/israelg99/deepvoice","last_synced_at":"2025-04-07T14:33:50.214Z","repository":{"id":65804135,"uuid":"83678024","full_name":"israelg99/deepvoice","owner":"israelg99","description":"Deep Voice: Real-time Neural Text-to-Speech","archived":false,"fork":false,"pushed_at":"2017-03-21T18:46:59.000Z","size":43,"stargazers_count":354,"open_issues_count":6,"forks_count":94,"subscribers_count":43,"default_branch":"master","last_synced_at":"2024-08-02T16:02:53.165Z","etag":null,"topics":["deep-learning","keras","machine-learning","phonemes","voice"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/israelg99.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-03-02T12:58:46.000Z","updated_at":"2024-06-18T05:56:17.000Z","dependencies_parsed_at":"2023-02-11T02:25:10.504Z","dependency_job_id":null,"html_url":"https://github.com/israelg99/deepvoice","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/israelg99%2Fdeepvoice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/israelg99%2Fdeepvoice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/israelg99%2Fdeepvoice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/israelg99%2Fdeepvoice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/israelg99","download_url":"https://codeload.github.com/israelg99/deepvoice/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223285143,"owners_count":17119843,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","keras","machine-learning","phonemes","voice"],"created_at":"2024-08-01T15:05:22.144Z","updated_at":"2024-11-06T04:31:08.877Z","avatar_url":"https://github.com/israelg99.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Deep Voice\n\n[![Join the chat at https://gitter.im/deep-voice/Lobby](https://badges.gitter.im/deep-voice/Lobby.svg)](https://gitter.im/deep-voice/Lobby?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)  \n*Based on the [Deep Voice paper](https://arxiv.org/pdf/1702.07825.pdf).*\n\nThis repository depends [on my Keras fork](https://github.com/israelg99/keras/tree/master) until it is merged with the official [Keras](https://github.com/fchollet/keras/tree/master) repository.  \nTo install: `pip3 install git+https://github.com/israelg99/keras.git`  \n**This will override your previously installed Keras version.**  \n\nDeep Voice is a text-to-speech system based entirely on deep neural networks.\n\nDeep Voice comprises five models:\n\n- Grapheme-to-phoneme converter.\n- Phoneme Segmentation.\n- Phoneme duration predictor.\n- Frequency predictor.\n- Audio synthesis.\n\n## Grapheme-to-phoneme\n##### Abstract\nThe grapheme-to-phoneme converter converts from written text (e.g English characters) to phonemes (encoded using a phonemic alphabet such as ARPABET).\n\n#### Architecture\nBased on [this architecture](https://arxiv.org/pdf/1506.00196.pdf) but with some changes.\n\nThe Grapheme-to-phoneme converter is an encoder-decoder:\n\n- **Encoder**: multi-layer, bidirectional encoder, with a gated recurrent unit (GRU) nonlinearity.  \n- **Decoder**: identical to the encoder but unidirectional.\n\nIt takes written text as input.\n\n#### Setup\n- **Initialization**: every decoder layer is initialized to the final hidden state of the corresponding encoder forward layer.  \n- **Training**: the architecture is trained with teacher forcing.  \n- **Decoding**: performed using beam search.\n\n#### Hyperparameters\n- **Encoder**: 3 bidirectional layers with 1024 units each.  \n- **Decoder**: 3 unidirectional layers of the same size as the encoder.  \n- **Beam Search**: width of 5 candidates.  \n- **Dropout**: 0.95 rate after each recurrent layer.\n\n## Phoneme Segmentation\n##### Abstract\n- The phoneme segmentation model locates phoneme boundaries in the voice dataset.  \n- Given an audio file and a phoneme-by-phoneme transcription of the audio, the segmentation model identifies where in the audio each phoneme begins and ends.  \n- The phoneme segmentation model is trained to output the alignment between a given utterance and a sequence of target phonemes. This task is similar to the problem of aligning speech to written output in speech recognition.\n\n### Architecture\nThe segmentation model uses the convolutional recurrent neural network based on [Deep Speech 2](https://arxiv.org/pdf/1512.02595.pdf).\n\n***The architecture graph***\n\n1. Audio vector.\n2. 20 MFCCs with 10ms stride.\n2. Double 2D convolutions (frequency bins * time).\n3. Triple bidirectional recurrent GRUs.\n4. Softmax.\n5. Output sequence of pairs.\n\n### Hyperparameters\n***Convolutions***\n- **Stride**: (9, 5).\n- **Dropout**: 0.95 rate after last convolution.\n\n***Recurrent layers***\n- **Dimensionality**: 512 GRU cells for each direction.\n- **Dropout**: 0.95 rate after the last recurrent layer.\n\n### Training\nThe segmentation model uses the [connectionist temporal classification (CTC) loss](ftp://ftp.idsia.ch/pub/juergen/icml2006.pdf).\n\n## Phoneme Duration + Frequency Predictor\n#### Abstract\nA single architecture is used to jointly predict phoneme duration and time-dependent fundamental frequency.\n\n#### *Phoneme Duration* Abstract\nThe phoneme duration predictor predicts the temporal duration of every phoneme in a phoneme sequence (an utterance).\n\n#### *Frequency Predictor* Abstract\nThe frequency predictor predicts whether a phoneme is voiced. If it is, the model predicts the fundamental frequency (F0) throughout the phoneme’s duration.\n\n### Architecture\n1. A sequence of phonemes with stresses, encoded in one-hot vector.\n2. Double fully-connected layers.\n3. Double unidirectional recurrent layers.\n4. Fully-connected layer.\n\n### Hyperparameters\n**Double fully-connected layers**\n- **Dimensionality**: 256.\n- **Dropout**: 0.8 rate after last layer.\n\n**Double unidirectional recurrent layers**\n- **Dimensionality**: 128 GRUs.\n- **Dropout**: 0.8 rate after last layer.\n\n## Audio Synthesis\n#### Abstract\n* Combines the outputs of the grapheme-to-phoneme, phoneme duration, and  frequency predictor models.\n* Synthesizes audio at a high sampling rate, corresponding to the desired text.\n* Uses a WaveNet variant which requires less parameters and is faster to train.\n\n### Architecture\nThe architecture is based on [WaveNet](https://arxiv.org/pdf/1609.03499.pdf) but with some changes.\n\n***Will be updated soon.***\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fisraelg99%2Fdeepvoice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fisraelg99%2Fdeepvoice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fisraelg99%2Fdeepvoice/lists"}