Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tristan-mcinnis/multimodal-voice-assistant
This project is a multi-modal AI voice assistant that uses OpenAI's GPT-4, audio processing with WhisperModel, speech recognition, clipboard extraction, and image processing to respond to user prompts.
https://github.com/tristan-mcinnis/multimodal-voice-assistant
ai assistant image-processing llm multimodal openai search-engine text-to-speech transcription tts vision
Last synced: about 1 month ago
JSON representation
This project is a multi-modal AI voice assistant that uses OpenAI's GPT-4, audio processing with WhisperModel, speech recognition, clipboard extraction, and image processing to respond to user prompts.
- Host: GitHub
- URL: https://github.com/tristan-mcinnis/multimodal-voice-assistant
- Owner: tristan-mcinnis
- License: mit
- Created: 2024-06-22T02:02:42.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2024-06-22T02:07:16.000Z (7 months ago)
- Last Synced: 2024-10-10T18:42:42.671Z (3 months ago)
- Topics: ai, assistant, image-processing, llm, multimodal, openai, search-engine, text-to-speech, transcription, tts, vision
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Multi-Modal AI Voice Assistant
This project is a multi-modal AI voice assistant that uses OpenAI's GPT-4o, audio processing with WhisperModel, speech recognition, clipboard extraction, and image processing to respond to user prompts.
## Installation
To install the required dependencies, follow these steps:
1. Clone the repository:
```bash
git clone https://github.com/nexuslux/Multimodal-voice-assistant
cd multimodal-voice-assistant
```2. Create a virtual environment (optional but recommended):
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```3. Install the required packages:
```bash
pip install -r requirements.txt
```## Configuration
1. Obtain an OpenAI API key and replace the placeholder in the code:
```python
openai_client = OpenAI(api_key="sk-your-api-key")
```## Usage
1. Run the script:
```bash
python your_script_name.py
```2. The assistant will listen for the wake word (`nova`) followed by your prompt. It supports various functionalities such as taking screenshots, capturing webcam images, and extracting clipboard content.
## Features
- **Wake Word Detection:** Starts listening for commands when the wake word 'nova' is detected.
- **Screenshot Capture:** Takes a screenshot and processes it for context.
- **Webcam Capture:** Captures an image from the webcam and processes it for context.
- **Clipboard Extraction:** Extracts text from the clipboard for additional context.
- **Enhanced Conversation Context:** Maintains a summary of previous exchanges for coherent responses.## Dependencies
- `openai`: OpenAI API client
- `Pillow`: Image processing
- `faster-whisper`: Whisper model for audio transcription
- `SpeechRecognition`: Speech recognition
- `pyperclip`: Clipboard handling
- `opencv-python-headless`: Computer vision
- `pyaudio`: Audio handling
- `rich`: Rich text formatting in the terminal
- `pygame`: Image capture from webcam
- `duckduckgo-search`: DuckDuckGo search integration
- `scikit-learn`: Machine learning utilities
- `numpy`: Numerical computations## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.