https://github.com/sovit-123/sam_molmo_whisper

An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.
https://github.com/sovit-123/sam_molmo_whisper

molmo segment-anything-model segmentanythingmodel vlm whisper

Last synced: about 1 month ago
JSON representation

An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.

Host: GitHub
URL: https://github.com/sovit-123/sam_molmo_whisper
Owner: sovit-123
License: apache-2.0
Created: 2024-10-10T01:43:26.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-02-28T01:20:58.000Z (3 months ago)
Last Synced: 2025-04-20T00:00:04.963Z (about 1 month ago)
Topics: molmo, segment-anything-model, segmentanythingmodel, vlm, whisper
Language: Jupyter Notebook
Homepage:
Size: 16.7 MB
Stars: 24
Watchers: 2
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# SAM_Molmo_Whisper

***Note: The project is in very initial stages and will change drastically in the near future. Things may break.***

**[Go to Setup](#Setup)**

A simple integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.

Capabilities:

* Segment objects with **SAM2.1** using point prompts.
* Points can be obtained by **prompting Molmo** with natural language. Molmo can take inputs by the **text box (typing)** or **Whisper via microphone (speech to text)**.

**Run the Gradio demo using**:

```
python app.py
```

https://github.com/user-attachments/assets/66a0620e-ede3-4018-8ee7-f261790747cb

## What's New

### October 30, 2024

* Added tabbed interface for video segmentation. Process remains the same. Either prompt via text or voice, upload a video and get the segmentation maps of the objects.

## Setup

### Clone Repo

```
git clone https://github.com/sovit-123/SAM_Molmo_Whisper.git
```

```
cd SAM_Molmo_Whisper
```

### Installing Requirements

Install Pytorch, Hugging Face Transformers, and the rest of the base requirements.

```
pip install -r requirements.txt
```

### Install SAM2

*It is highly recommended to clone SAM2 to a separate directory other than this project directory and run the installation commands*.

```
git clone https://github.com/facebookresearch/sam2.git && cd sam2

pip install -e .
```

## To Use CLIP Auto Labelling

After installing the requirements install SpaCy's `en_core_web_sm` model.

```
spacy download en_core_web_sm
```

### Run the App

```
python app.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sovit-123/sam_molmo_whisper

Awesome Lists containing this project

README