https://github.com/prithivsakthiur/orpheus-tts-edge

Play with Orpheus TTS, a Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been fine-tuned to deliver human-level speech synthesis 🔥🗣️
https://github.com/prithivsakthiur/orpheus-tts-edge

gradio gradio-python-llm huggingface-transformers llama llm orpheus tts

Last synced: 11 months ago
JSON representation

Play with Orpheus TTS, a Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been fine-tuned to deliver human-level speech synthesis 🔥🗣️

Host: GitHub
URL: https://github.com/prithivsakthiur/orpheus-tts-edge
Owner: PRITHIVSAKTHIUR
Created: 2025-03-20T17:59:28.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-03-29T14:07:59.000Z (about 1 year ago)
Last Synced: 2025-03-29T15:21:50.243Z (about 1 year ago)
Topics: gradio, gradio-python-llm, huggingface-transformers, llama, llm, orpheus, tts
Language: Python
Homepage:
Size: 813 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# **Orpheus-Edge-TTS-Demo**

https://github.com/user-attachments/assets/1896b9bc-dccf-4180-9e1d-5e658cf4e3e5

Orpheus TTS, a Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been fine-tuned to deliver human-level speech synthesis

> [!WARNING]
> Don't forget to add the `HF_TOKEN` to the environment to access the gated Hugging Face models.

## Features

### 1. **Multimodal Input Support**
- **Text Input**: Process text-based queries with DeepHermes Llama for natural language understanding.
- **Image Input**: Analyze and describe images using Qwen2-VL.
- **Video Input**: Process videos by extracting key frames and summarizing content.

### 2. **Advanced Text-to-Speech (TTS)**
- **Orpheus TTS**: Generate realistic speech with customizable voices (`tara`, `dan`, `emma`, `josh`).
- **Emotion Support**: Add emotions like ``, ``, ``, etc., to make the speech more expressive.
- **Direct TTS**: Convert text to speech directly using `@-tts` (e.g., `@tara-tts`).
- **LLM-Augmented TTS**: Generate a response using DeepHermes Llama and then convert it to speech using `@-llm` (e.g., `@tara-llm`).

### 3. **Video Processing**
- Use the `@video-infer` command to analyze and summarize video content. The system extracts key frames and processes them with Qwen2-VL.

### 4. **Customizable Parameters**
- Adjust generation parameters like `temperature`, `top-p`, `top-k`, and `repetition penalty` to fine-tune responses.

---

## Usage

### Commands
1. **Direct TTS**:
- Use `@-tts` to directly convert text to speech.
- Example: `@tara-tts Hey, I’m Tara, [laugh] and I’m a speech generation model!`

2. **LLM-Augmented TTS**:
- Use `@-llm` to generate a response with DeepHermes Llama and then convert it to speech.
- Example: `@tara-llm Explain the causes of rainbows.`

3. **Video Processing**:
- Use `@video-infer` to analyze and summarize video content.
- Example: `@video-infer Summarize the event in this video.`

4. **Regular Chat**:
- Input text or upload images/videos for multimodal processing.
- Example: `Write a Python program for array rotation.`

---

## Examples

### Text-to-Speech (TTS)
- `@josh-tts Hey! I’m Josh, [gasp] and wow, did I just surprise you with my realistic voice?`
- `@emma-tts Hey, I’m Emma, [sigh] and yes, I can talk just like a person… even when I’m tired.`

### LLM-Augmented TTS
- `@dan-llm Explain the General Relativity theorem in short.`
- `@tara-llm Who is Nikola Tesla, and why did he die?`

### Video Processing
- `@video-infer Summarize the event in this video.`
- `@video-infer Describe the video.`

### Multimodal Input
- `summarize the letter` (with an uploaded image).
- `Explain the causes of rainbows` (with an uploaded video).

---

## Setup

1. **Install Dependencies**:
Ensure you have the required Python packages installed:
```bash
pip install torch gradio transformers huggingface-hub snac dotenv
```

2. **Environment Variables**:
- Set `MAX_INPUT_TOKEN_LENGTH` in `.env` to control the maximum input token length for the LLM.

3. **Run the Application**:
```bash
python app.py
```

4. **Access the Interface**:
- The Gradio interface will launch locally. Use the provided examples or input your own queries.

---

## Models Used

1. **DeepHermes Llama**:
- A fine-tuned Llama model for natural language understanding and generation.
- Model ID: `prithivMLmods/DeepHermes-3-Llama-3-3B-Preview-abliterated`.

2. **Qwen2-VL**:
- A multimodal model for image and video processing.
- Model ID: `prithivMLmods/Qwen2-VL-OCR2-2B-Instruct`.

3. **Orpheus TTS**:
- A high-quality text-to-speech model for generating realistic speech.
- Model ID: `canopylabs/orpheus-3b-0.1-ft`.

4. **SNAC**:
- A neural audio codec used for decoding TTS outputs.
- Model ID: `hubertsiuzdak/snac_24khz`.

---

## Customization

- **Voices**: Choose from `tara`, `dan`, `emma`, or `josh` for TTS.
- **Emotions**: Add emotions like ``, ``, ``, etc., to make the speech more expressive.
- **Generation Parameters**: Adjust `temperature`, `top-p`, `top-k`, and `repetition penalty` to fine-tune responses.

---

## Notes

- **Hardware Requirements**: A GPU is recommended for optimal performance, especially for TTS and video processing.
- **Limitations**:
- Video processing is limited to 10 key frames per video.
- TTS generation may take longer for longer texts.

---

## Acknowledgments

- **Hugging Face** for providing the models and tools.
- **Gradio** for the intuitive interface.
- **SNAC** and **Orpheus TTS** for high-quality speech synthesis.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/prithivsakthiur/orpheus-tts-edge

Awesome Lists containing this project

README