https://github.com/gusanmaz/echosight
EchoSight is a tool that helps visually impaired individuals by audibly describing images taken with a Raspberry Pi Camera or inputted via image path or URL across different operating systems.
https://github.com/gusanmaz/echosight
cogvl coqui-tts llm llms raspberry-pi replicate replicate-api seamlessm4t visual-audio visual-audio-navigation vllm
Last synced: 2 months ago
JSON representation
EchoSight is a tool that helps visually impaired individuals by audibly describing images taken with a Raspberry Pi Camera or inputted via image path or URL across different operating systems.
- Host: GitHub
- URL: https://github.com/gusanmaz/echosight
- Owner: gusanmaz
- Created: 2024-01-02T15:13:53.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-03T11:43:45.000Z (over 1 year ago)
- Last Synced: 2025-03-30T09:31:33.158Z (3 months ago)
- Topics: cogvl, coqui-tts, llm, llms, raspberry-pi, replicate, replicate-api, seamlessm4t, visual-audio, visual-audio-navigation, vllm
- Language: Python
- Homepage:
- Size: 213 KB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# EchoSight
EchoSight is designed to assist visually impaired individuals by providing audible descriptions of images captured by a camera. It operates in two modes: one for capturing images using a Raspberry Pi Camera and listening to their voice descriptions, and another for inputting an image path or URL on various operating systems to hear voice descriptions.
## Output Files
The project generates multiple outputs during operation:
- **Image Files**: Captured or downloaded images are saved in the `output` directory.
- **Text Descriptions**: Text descriptions of the images in both English and Turkish are saved as `.txt` files in the `output` directory.
- **Audio Files**: The Turkish voice description of the image is saved as a `.wav` file in the `output` directory.
- **Log Files**: Event logs and errors are recorded and saved in `events.log` files within the respective output subdirectories.## Configurable Parameters
- **KEY_ACTION**: In `rpi.py`, this is set to 'KEY_S' by default. Modify the `KEY_ACTION` variable to change the key action.
- **CAMERA_DELAY**: In `rpi.py`, the default camera delay is '0.1' seconds. Adjust the `CAMERA_DELAY` variable to change this setting.
- **MAX_WIDTH**: In `image2speech.py`, the maximum image width for resizing is controlled by `MAX_WIDTH`. Alter this parameter as needed.## Pre-requisites (For Raspberry Pi Usage)
- Ensure Raspberry Pi OS is installed.
- Use [Raspberry Pi Imager](https://downloads.raspberrypi.org/imager/imager_latest.exe) to prepare your SD card.
- Test your Raspberry Pi Camera: `libcamera-jpeg -o z.jpg`.## Installation
- Obtain your Replicate.com API token:
- For Bash: `echo 'export REPLICATE_API_TOKEN=your_token_here' >> ~/.bashrc`.
- For Zsh: `echo 'export REPLICATE_API_TOKEN=your_token_here' >> ~/.zshrc`.
- Set `keyboard_path` correctly if automatic detection fails. Refer to [this guide](https://chat.openai.com/share/bd2753d8-0ee3-4963-8e26-9569575470eb).
- Clone and setup the EchoSight environment:
```bash
git clone https://github.com/gusanmaz/echosight
cd echosight
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt### Usage
**(Raspberry Pi) To capture images from Raspberry Pi Camera by pressing a keyboard button (default: S) to listen
voice description of the captured image*** `python3 rpi.py`
**(ALL OSes) Give an image path or URL to listen voice description of the image**
* `python3 url2speech.py image_path_or_url`
### Models
This project uses models from https://replicate.com/ to generate voice descriptions of the images. You can find the models used in this project from the links below.
* **cogvlm**
* Replicate Model: https://replicate.com/cjwbw/cogvlm
* Github Repo: https://github.com/THUDM/CogVLM
* **Seamless Communication**
* Replicate Model: https://replicate.com/cjwbw/seamless-communication
* Github Repo: https://github.com/facebookresearch/seamless_communication
* **Coqui XTTS-v2**
* Replicate Model: https://replicate.com/cjwbw/coqui-xtts-v2
* Github Repo: https://github.com/coqui-ai/TTSFuture versions may incorporate different models, and the code could be adapted for easier experimentation with various models.
### Cost
* **Conservative Cost Estimate**: 0.2$ per image
* **Conservative Runtime Estimate**: 40 seconds per image to produce audio (excluding time spent for starting the
models
on Replicate.com)### License
Apache License 2.0