https://github.com/nipponjo/tts_arabic

TTS for Arabic (FastPitch, Mixer-TTS) in the ONNX format
https://github.com/nipponjo/tts_arabic

arabic arabic-tts fastpitch hifigan mixer-tts multi-speaker-tts onnx onnxruntime python speech speech-synthesis text-to-speech tts tts-model vocos voice-synthesis

Last synced: 4 months ago
JSON representation

TTS for Arabic (FastPitch, Mixer-TTS) in the ONNX format

Host: GitHub
URL: https://github.com/nipponjo/tts_arabic
Owner: nipponjo
Created: 2024-04-20T20:44:08.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-07-19T12:11:06.000Z (4 months ago)
Last Synced: 2025-07-19T16:54:20.412Z (4 months ago)
Topics: arabic, arabic-tts, fastpitch, hifigan, mixer-tts, multi-speaker-tts, onnx, onnxruntime, python, speech, speech-synthesis, text-to-speech, tts, tts-model, vocos, voice-synthesis
Language: Python
Homepage:
Size: 85.9 KB
Stars: 25
Watchers: 3
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          Arabic TTS model (FastPitch, MixerTTS) from the [tts-arabic-pytorch](https://github.com/nipponjo/tts-arabic-pytorch) repo in the ONNX format.

Audio samples can be found [here](https://nipponjo.github.io/tts-arabic-speakers).

Install with:

```

pip install git+https://github.com/nipponjo/tts_arabic.git

```

Examples:

```python

# %%

from tts_arabic import tts

# %%

text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."

wave = tts(text, speaker=2, pace=0.9, play=True)

# %% Buckwalter transliteration

text = ">als~alAmu Ealaykum yA Sadiyqiy."

wave = tts(text, speaker=0, play=True)

# %% Unvocalized input

text_unvoc = "القهوة مشروب يعد من بذور البن المحمصة"

wave = tts(text_unvoc, play=True, vowelizer='shakkelha')

```

Pretrained models:

|Model|Model ID|Type|#params|Paper|Output|

|-------|---|---|------|----|----|

|FastPitch|fastpitch|Text->Mel|46.3M|[arxiv](https://arxiv.org/abs/2006.06873)|Mel (80 bins)|

|MixerTTS|mixer128|Text->Mel|2.9M|[arxiv](https://arxiv.org/abs/2110.03584)|Mel (80 bins)|

|MixerTTS|mixer80|Text->Mel|1.5M|[arxiv](https://arxiv.org/abs/2110.03584)|Mel (80 bins)|

|HiFi-GAN|hifigan|Vocoder|13.9M|[arxiv](https://arxiv.org/abs/2010.05646)|Wave (22.05kHz)|

|Vocos|vocos|Vocoder|13.4M|[arxiv](https://arxiv.org/abs/2306.00814)|Wave (22.05kHz)|

|Vocos|vocos44|Vocoder|14.0M|[arxiv](https://arxiv.org/abs/2306.00814)|Wave (44.1kHz)|

The sequence of transformations is as follows:

*Text* → Phonemizer → *Phonemes* → Tokenizer → *Token Ids* → **Text->Mel** model → *Mel spectrogram* → **Vocoder** model → *Wave*

The `Text->Mel` models map token ids to mel frames. All models use the 80 bin configuration proposed by [HiFi-GAN](https://github.com/jik876/hifi-gan). This mel spectrogram contains frequencies up to 8kHz. The `vocoder` models map the mel spectrogram to a waveform. The vocoders with `vocoder_id` `hifigan` and `vocos` artificially extend the bandwidth to 11025Hz, and `vocos44` to 22050Hz. Samples for comparing the models can be found [here](https://nipponjo.github.io/tts-arabic-speakers/#models-cmp).

TTS options:

```python

from tts_arabic import tts

text = "اَلسَّلامُ عَلَيكُم يَا صَدِيقِي."

wave = tts(

    text, # input text

    speaker = 1, # speaker id; choose between 0,1,2,3

    pace = 1, # speaker pace

    denoise = 0.005, # vocoder denoiser strength

    volume = 0.9, # Max amplitude (between 0 and 1)

    play = True, # play audio?

    pitch_mul = 1, # pitch multiplier

    pitch_add = 0, # pitch offset

    vowelizer = None, # vowelizer model

    model_id = 'fastpitch', # Model ID for Text->Mel model

    vocoder_id = 'hifigan', # Model ID for vocoder model

    cuda = None, # Optional; CUDA device index

    save_to = './test.wav', # Optionally; save audio WAV file

    bits_per_sample = 32, # when save_to is specified (8, 16 or 32 bits)

    )

```

Vowelizer models:

|Model|Model ID|Paper|Repo|Architecture|

|-----|--------|---------|----|--|

|CATT|catt_eo|[arxiv](https://arxiv.org/abs/2407.03236)|[github](https://github.com/abjadai/catt)|Transformer Encoder|

|Shakkelha|shakkelha|[arxiv](https://arxiv.org/abs/1911.03531)|[github](https://github.com/AliOsm/shakkelha)|Bi-LSTM|

|Shakkala|shakkala|-|[github](https://github.com/Barqawiz/Shakkala)|Bi-LSTM|

References:

The vocoder `vocos44` was converted from ([patriotyk/vocos-mel-hifigan-compat-44100khz](https://huggingface.co/patriotyk/vocos-mel-hifigan-compat-44100khz)).

The vowelizer `catt_eo` was converted from https://github.com/abjadai/catt/releases/tag/v2 *best_eo_mlm_ns_epoch_193.pt* (License: [Apache-2.0](https://github.com/abjadai/catt?tab=Apache-2.0-1-ov-file#readme))

![DALL·E 2025-03-14 18 56 01 - A surreal digital painting of a camel in a vast desert, with a futuristic speaker embedded in its mouth, symbolizing text-to-speech technology  The ca](https://github.com/user-attachments/assets/bcd31436-1e76-4432-9072-d0695f4d87e0)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nipponjo/tts_arabic

Awesome Lists containing this project

README