https://github.com/evangks/vision-voice-multimodal-app
An AI-powered chatbot that combines image understanding, voice input, and multilingual speech output. Users can upload images, ask questions by voice, and receive intelligent spoken answers. Showcases state-of-the-art models (Gemini, Whisper, Kokoro TTS) and robust Python engineering.
https://github.com/evangks/vision-voice-multimodal-app
ai-chatbot computer-vision deep-learning flask-application full-stack gemini gradio kokoro machine-learning multimodal nlp portfolio python stt tts whisper
Last synced: 4 months ago
JSON representation
An AI-powered chatbot that combines image understanding, voice input, and multilingual speech output. Users can upload images, ask questions by voice, and receive intelligent spoken answers. Showcases state-of-the-art models (Gemini, Whisper, Kokoro TTS) and robust Python engineering.
- Host: GitHub
- URL: https://github.com/evangks/vision-voice-multimodal-app
- Owner: EvanGks
- License: mit
- Created: 2025-05-24T18:46:41.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-06-10T07:49:40.000Z (4 months ago)
- Last Synced: 2025-06-10T08:35:51.966Z (4 months ago)
- Topics: ai-chatbot, computer-vision, deep-learning, flask-application, full-stack, gemini, gradio, kokoro, machine-learning, multimodal, nlp, portfolio, python, stt, tts, whisper
- Language: Python
- Homepage:
- Size: 224 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π§ Multimodal AI Visual Assistant
A full-stack, production-grade multimodal AI assistant that enables users to interact with images using natural language and voice. Get instant, multilingual spoken answers powered by state-of-the-art AI models.
## π Features
- **Image Upload & Analysis**: Ask questions about any image using your voice.
- **Voice Input & Transcription**: Record queries, transcribed in real-time with OpenAI Whisper.
- **Advanced Multimodal Reasoning**: Google Gemini 2.5 Flash analyzes images and queries for detailed, factual answers.
- **Multilingual Voice Output**: Hear responses in your chosen language, accent, and gender using Kokoro TTS.
- **Voice Customization**: Select from dozens of voices (male/female), languages, and adjust speech rate.
- **Accessible, Responsive UI**: Gradio interface designed for keyboard navigation, screen readers, and all devices.
- **Automated Asset Management**: All TTS models and voices are auto-downloaded on first run.
- **Robust Error Handling**: User-friendly error messages and backend logging.
- **Comprehensive Testing**: Pytest suite covers backend, frontend, and model logic.## πΈ Screenshots
Below are some screenshots of the application in action. (Add your images to the `assets/` directory and reference them here.)

## ποΈ Architecture
```
User
β
βΌ
Gradio UI (app/frontend/gradio_app.py)
β
βΌ
Flask API (app/backend/services/flask_app.py)
β
βΌ
ModelManager (app/backend/utils/model_manager.py)
ββ Whisper (Speech-to-Text)
ββ Gemini (Image+Text Reasoning)
ββ Kokoro TTS (Text-to-Speech)
```- **Concurrent Backend & Frontend**: Both run together via `main.py` (threaded).
- **All assets and models are managed automatically.**## π Supported Languages & Voices
- **English** (American, British,): Multiple male/female voices
- **Japanese**: Male/female voices
- **Mandarin Chinese**: Multiple voices
- **French**: Female voice
- **Spanish**: Male/female voices
- **Italian**: Male/female voices
- **Brazilian Portuguese**: Male/female voices
- **Hindi**: Male/female voicesSee [`app/backend/utils/kokoro_voices.py`](app/backend/utils/kokoro_voices.py) for the full list.
## π οΈ Installation
1. **Clone the repository:**
```bash
git clone https://github.com/EvanGks/vision-voice-multimodal-app.git
cd multimodal-ai
```2. **Create and activate a virtual environment:**
```bash
python -m venv .venv
.venv\Scripts\activate # On Windows
source .venv/bin/activate # On macOS/Linux
```3. **Install dependencies:**
```bash
pip install -r requirements.txt
```4. **Configure environment variables:**
- Copy `.env.example` to `.env` and fill in your Google API key and Flask secret.
- **All configuration variables (API keys, model names, upload folders, etc.) are loaded exclusively from the `.env` file in the root directory.**
- The application does **not** read configuration from system environment variablesβ**all configuration must be set in `.env`**.
- All model assets will be auto-downloaded.5. **Run the application:**
```bash
python main.py
```
- Access the UI at [http://localhost:7860](http://localhost:7860)## ποΈ Project Structure
```
Multimodal_AI/
βββ app/
β βββ backend/
β β βββ services/ # Flask API endpoints
β β βββ utils/ # Model management, voice metadata, text utils
β β βββ kokoro_assets/ # TTS model assets (auto-downloaded)
β βββ frontend/ # Gradio UI implementation
β βββ uploads/ # Uploaded files (auto-cleaned)
βββ assets/ # Screenshots and static assets
βββ tests/ # Test suite (API, UI, models)
βββ .env, .env.example # Environment variables
βββ .gitignore # Git ignore file
βββ LICENSE # MIT License
βββ main.py # Application entry point
βββ README.md # Project documentation
βββ requirements.txt # Python dependencies
```- All code is modular and organized for clarity and extensibility.
- The `assets/` directory is the recommended place for screenshots and static files.## π§ͺ Testing
Run all tests with:
```bash
pytest
```
- Tests cover API endpoints, UI workflow, and model logic.
- Test data and assets are auto-managed and cleaned up.## βΏ Accessibility & UX
- **Keyboard navigation** and **screen reader** support.
- **High color contrast** and **responsive design**.
- **Clear feedback** for all user actions and errors.
- **Semantic HTML** and ARIA attributes for assistive technologies.
- **Resizable text** and mobile-friendly layout.## π Security
- **No secrets in code**βall credentials via `.env`.
- **All configuration is managed via the `.env` file only. System environment variables are not used for configuration.**
- **Strict file upload validation** and privacy cleanup.
- **Sensitive/model files are git-ignored.**
- **API key management** and environment-based configuration.## π§© Extensibility
- Add new voices/languages by updating `kokoro_voices.py`.
- Swap or extend models via `ModelManager` and configure them in `.env`.
- **All configuration is centralized in `.env` for easy reproducibility and sharing.**
- Modular, testable codebase for rapid prototyping.## β¨ Why This Project Stands Out
- **End-to-end AI workflow**: From voice to vision to speech, all in one app.
- **Production best practices**: Security, error handling, accessibility, and testing.
- **Portfolio-ready**: Demonstrates full-stack AI, modern Python, and real-world deployment skills.
- **Comprehensive documentation and code comments** for maintainability.
- **Automated asset management** for seamless setup.## π Future Enhancements
- Add support for additional languages and regional accents in TTS and STT.
- Implement real-time streaming responses for faster feedback.
- Develop a mobile-friendly or native mobile UI.
- Integrate more advanced image analysis models (e.g., OCR, object detection).
- Add user authentication and personalized settings.
- Enable cloud deployment with scalable infrastructure.
- Provide downloadable audio/text transcripts for user queries.
- Add progress indicators and better feedback for long-running operations.
- Expand accessibility features (e.g., high-contrast mode, localization).## π License
This project is licensed under the MIT License (c) 2025 Evan GKS. See the [LICENSE](LICENSE) file for details.
## π Acknowledgments
- [OpenAI Whisper](https://huggingface.co/openai/whisper-tiny)
- [Google Gemini](https://ai.google.dev/)
- [Kokoro TTS](https://github.com/hexgrad/Kokoro)
- [Gradio](https://gradio.app/)
- [Flask](https://flask.palletsprojects.com/)## π¬ Contact
For questions or feedback, please reach out via:- **GitHub:** [EvanGks](https://github.com/EvanGks)
- **X (Twitter):** [@Evan6471133782](https://x.com/Evan6471133782)
- **LinkedIn:** [Evangelos Gakias](https://www.linkedin.com/in/evangelos-gakias-346a9072)
- **Kaggle:** [evangelosgakias](https://www.kaggle.com/evangelosgakias)
- **Email:** [evangks88@gmail.com](mailto:evangks88@gmail.com)---