https://github.com/gusanmaz/echosight

EchoSight is a tool that helps visually impaired individuals by audibly describing images taken with a Raspberry Pi Camera or inputted via image path or URL across different operating systems.
https://github.com/gusanmaz/echosight

cogvl coqui-tts llm llms raspberry-pi replicate replicate-api seamlessm4t visual-audio visual-audio-navigation vllm

Last synced: 2 months ago
JSON representation

EchoSight is a tool that helps visually impaired individuals by audibly describing images taken with a Raspberry Pi Camera or inputted via image path or URL across different operating systems.

Host: GitHub
URL: https://github.com/gusanmaz/echosight
Owner: gusanmaz
Created: 2024-01-02T15:13:53.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-01-03T11:43:45.000Z (over 1 year ago)
Last Synced: 2025-03-30T09:31:33.158Z (3 months ago)
Topics: cogvl, coqui-tts, llm, llms, raspberry-pi, replicate, replicate-api, seamlessm4t, visual-audio, visual-audio-navigation, vllm
Language: Python
Homepage:
Size: 213 KB
Stars: 4
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# EchoSight

EchoSight is designed to assist visually impaired individuals by providing audible descriptions of images captured by a camera. It operates in two modes: one for capturing images using a Raspberry Pi Camera and listening to their voice descriptions, and another for inputting an image path or URL on various operating systems to hear voice descriptions.

## Output Files

The project generates multiple outputs during operation:

- **Image Files**: Captured or downloaded images are saved in the `output` directory.
- **Text Descriptions**: Text descriptions of the images in both English and Turkish are saved as `.txt` files in the `output` directory.
- **Audio Files**: The Turkish voice description of the image is saved as a `.wav` file in the `output` directory.
- **Log Files**: Event logs and errors are recorded and saved in `events.log` files within the respective output subdirectories.

## Configurable Parameters

- **KEY_ACTION**: In `rpi.py`, this is set to 'KEY_S' by default. Modify the `KEY_ACTION` variable to change the key action.
- **CAMERA_DELAY**: In `rpi.py`, the default camera delay is '0.1' seconds. Adjust the `CAMERA_DELAY` variable to change this setting.
- **MAX_WIDTH**: In `image2speech.py`, the maximum image width for resizing is controlled by `MAX_WIDTH`. Alter this parameter as needed.

## Pre-requisites (For Raspberry Pi Usage)

- Ensure Raspberry Pi OS is installed.
- Use [Raspberry Pi Imager](https://downloads.raspberrypi.org/imager/imager_latest.exe) to prepare your SD card.
- Test your Raspberry Pi Camera: `libcamera-jpeg -o z.jpg`.

## Installation

- Obtain your Replicate.com API token:
- For Bash: `echo 'export REPLICATE_API_TOKEN=your_token_here' >> ~/.bashrc`.
- For Zsh: `echo 'export REPLICATE_API_TOKEN=your_token_here' >> ~/.zshrc`.
- Set `keyboard_path` correctly if automatic detection fails. Refer to [this guide](https://chat.openai.com/share/bd2753d8-0ee3-4963-8e26-9569575470eb).
- Clone and setup the EchoSight environment:
```bash
git clone https://github.com/gusanmaz/echosight
cd echosight
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

### Usage
**(Raspberry Pi) To capture images from Raspberry Pi Camera by pressing a keyboard button (default: S) to listen
voice description of the captured image**

* `python3 rpi.py`

**(ALL OSes) Give an image path or URL to listen voice description of the image**

* `python3 url2speech.py image_path_or_url`

### Models

This project uses models from https://replicate.com/ to generate voice descriptions of the images. You can find the models used in this project from the links below.

* **cogvlm**
* Replicate Model: https://replicate.com/cjwbw/cogvlm
* Github Repo: https://github.com/THUDM/CogVLM
* **Seamless Communication**
* Replicate Model: https://replicate.com/cjwbw/seamless-communication
* Github Repo: https://github.com/facebookresearch/seamless_communication
* **Coqui XTTS-v2**
* Replicate Model: https://replicate.com/cjwbw/coqui-xtts-v2
* Github Repo: https://github.com/coqui-ai/TTS

Future versions may incorporate different models, and the code could be adapted for easier experimentation with various models.

### Cost

* **Conservative Cost Estimate**: 0.2$ per image
* **Conservative Runtime Estimate**: 40 seconds per image to produce audio (excluding time spent for starting the
models
on Replicate.com)

### License
Apache License 2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gusanmaz/echosight

Awesome Lists containing this project

README