https://github.com/mouleshgs/yolo-dense-captioning

Real-time video captioning using YOLOv8 for object detection, GIT for initial captions, and LLaMA 3 for natural language enhancement — all running locally with Ollama.
https://github.com/mouleshgs/yolo-dense-captioning

dense-captioning image-to-text llama3 yolov8

Last synced: about 1 month ago
JSON representation

Real-time video captioning using YOLOv8 for object detection, GIT for initial captions, and LLaMA 3 for natural language enhancement — all running locally with Ollama.

Host: GitHub
URL: https://github.com/mouleshgs/yolo-dense-captioning
Owner: mouleshgs
Created: 2025-06-19T19:26:20.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-06-20T13:34:52.000Z (12 months ago)
Last Synced: 2025-06-20T14:33:05.606Z (12 months ago)
Topics: dense-captioning, image-to-text, llama3, yolov8
Language: Python
Homepage:
Size: 9.77 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# YOLO-Based Dense Captioning in Real-Time

This project performs real-time video analysis by combining object detection using YOLOv8, image captioning using GIT, and caption enhancement through a local LLaMA 3 language model.

The result is a more descriptive and natural understanding of the scene captured from a live video stream.

---

## Features

- Real-time video capture from webcam
- Object detection using YOLOv8 (via Ultralytics)
- Caption generation using GIT (`microsoft/git-base`)
- Caption enhancement using LLaMA 3 (via Ollama)
- Visual display of bounding boxes and enhanced captions on video frames

---

## Project Structure

```
yolo-dense-captioning/
├── main.py # Main script: handles video capture, detection, captioning, and enhancement
├── detector.py # YOLOv8 detector class
├── caption_model.py # GIT-based caption generator
├── caption_enhancer.py # LLaMA 3 caption enhancer using Ollama's local API
├── utils/
│ ├── preprocessor.py # Image/frame preprocessing helpers
│ └── visualizer.py # Functions to draw bounding boxes and captions on frames
├── assets/ # Optional: sample frames or icons
├── requirements.txt # Python dependencies
└── README.md # Project documentation
```

---

## Installation

1. Clone the repository:

```bash
git clone https://github.com/mouleshgs/yolo-dense-captioning.git
cd yolo-dense-captioning
```

2. Set up a virtual environment:

```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```

3. Install the required packages:

```bash
pip install -r requirements.txt
```

4. Install and run Ollama for LLaMA 3:

- Download Ollama: https://ollama.com
- Pull the model:

```bash
ollama pull llama3
```

---

## Running the Project

To start real-time captioning:

```bash
python main.py
```

The webcam will open, and you'll see live object detection with an enhanced caption displayed at the top of the screen. Press `q` to exit.

---

## Notes

- The LLaMA 3 model runs locally via Ollama. Ensure Ollama is properly installed and accessible from your system PATH.
- Caption generation is based on `microsoft/git-base` and may vary depending on image clarity.
- For faster inference, you can use a optimized model like `llama3.1` or `llama3.2`.

---

## License

This project is for educational and research purposes. For commercial use, please check the licenses of Meta’s LLaMA 3, Microsoft's GIT, and Ultralytics YOLOv8.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mouleshgs/yolo-dense-captioning

Awesome Lists containing this project

README