An open API service indexing awesome lists of open source software.

https://github.com/mouleshgs/yolo-dense-captioning

Real-time video captioning using YOLOv8 for object detection, GIT for initial captions, and LLaMA 3 for natural language enhancement — all running locally with Ollama.
https://github.com/mouleshgs/yolo-dense-captioning

dense-captioning image-to-text llama3 yolov8

Last synced: about 1 month ago
JSON representation

Real-time video captioning using YOLOv8 for object detection, GIT for initial captions, and LLaMA 3 for natural language enhancement — all running locally with Ollama.

Awesome Lists containing this project

README

          

# YOLO-Based Dense Captioning in Real-Time

This project performs real-time video analysis by combining object detection using YOLOv8, image captioning using GIT, and caption enhancement through a local LLaMA 3 language model.

The result is a more descriptive and natural understanding of the scene captured from a live video stream.

---

## Features

- Real-time video capture from webcam
- Object detection using YOLOv8 (via Ultralytics)
- Caption generation using GIT (`microsoft/git-base`)
- Caption enhancement using LLaMA 3 (via Ollama)
- Visual display of bounding boxes and enhanced captions on video frames

---

## Project Structure

```
yolo-dense-captioning/
├── main.py # Main script: handles video capture, detection, captioning, and enhancement
├── detector.py # YOLOv8 detector class
├── caption_model.py # GIT-based caption generator
├── caption_enhancer.py # LLaMA 3 caption enhancer using Ollama's local API
├── utils/
│ ├── preprocessor.py # Image/frame preprocessing helpers
│ └── visualizer.py # Functions to draw bounding boxes and captions on frames
├── assets/ # Optional: sample frames or icons
├── requirements.txt # Python dependencies
└── README.md # Project documentation
```

---

## Installation

1. Clone the repository:

```bash
git clone https://github.com/mouleshgs/yolo-dense-captioning.git
cd yolo-dense-captioning
```

2. Set up a virtual environment:

```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```

3. Install the required packages:

```bash
pip install -r requirements.txt
```

4. Install and run Ollama for LLaMA 3:

- Download Ollama: https://ollama.com
- Pull the model:

```bash
ollama pull llama3
```

---

## Running the Project

To start real-time captioning:

```bash
python main.py
```

The webcam will open, and you'll see live object detection with an enhanced caption displayed at the top of the screen. Press `q` to exit.

---

## Notes

- The LLaMA 3 model runs locally via Ollama. Ensure Ollama is properly installed and accessible from your system PATH.
- Caption generation is based on `microsoft/git-base` and may vary depending on image clarity.
- For faster inference, you can use a optimized model like `llama3.1` or `llama3.2`.

---

## License

This project is for educational and research purposes. For commercial use, please check the licenses of Meta’s LLaMA 3, Microsoft's GIT, and Ultralytics YOLOv8.