https://github.com/mouleshgs/yolo-dense-captioning
Real-time video captioning using YOLOv8 for object detection, GIT for initial captions, and LLaMA 3 for natural language enhancement — all running locally with Ollama.
https://github.com/mouleshgs/yolo-dense-captioning
dense-captioning image-to-text llama3 yolov8
Last synced: about 1 month ago
JSON representation
Real-time video captioning using YOLOv8 for object detection, GIT for initial captions, and LLaMA 3 for natural language enhancement — all running locally with Ollama.
- Host: GitHub
- URL: https://github.com/mouleshgs/yolo-dense-captioning
- Owner: mouleshgs
- Created: 2025-06-19T19:26:20.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-06-20T13:34:52.000Z (12 months ago)
- Last Synced: 2025-06-20T14:33:05.606Z (12 months ago)
- Topics: dense-captioning, image-to-text, llama3, yolov8
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# YOLO-Based Dense Captioning in Real-Time
This project performs real-time video analysis by combining object detection using YOLOv8, image captioning using GIT, and caption enhancement through a local LLaMA 3 language model.
The result is a more descriptive and natural understanding of the scene captured from a live video stream.
---
## Features
- Real-time video capture from webcam
- Object detection using YOLOv8 (via Ultralytics)
- Caption generation using GIT (`microsoft/git-base`)
- Caption enhancement using LLaMA 3 (via Ollama)
- Visual display of bounding boxes and enhanced captions on video frames
---
## Project Structure
```
yolo-dense-captioning/
├── main.py # Main script: handles video capture, detection, captioning, and enhancement
├── detector.py # YOLOv8 detector class
├── caption_model.py # GIT-based caption generator
├── caption_enhancer.py # LLaMA 3 caption enhancer using Ollama's local API
├── utils/
│ ├── preprocessor.py # Image/frame preprocessing helpers
│ └── visualizer.py # Functions to draw bounding boxes and captions on frames
├── assets/ # Optional: sample frames or icons
├── requirements.txt # Python dependencies
└── README.md # Project documentation
```
---
## Installation
1. Clone the repository:
```bash
git clone https://github.com/mouleshgs/yolo-dense-captioning.git
cd yolo-dense-captioning
```
2. Set up a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```
3. Install the required packages:
```bash
pip install -r requirements.txt
```
4. Install and run Ollama for LLaMA 3:
- Download Ollama: https://ollama.com
- Pull the model:
```bash
ollama pull llama3
```
---
## Running the Project
To start real-time captioning:
```bash
python main.py
```
The webcam will open, and you'll see live object detection with an enhanced caption displayed at the top of the screen. Press `q` to exit.
---
## Notes
- The LLaMA 3 model runs locally via Ollama. Ensure Ollama is properly installed and accessible from your system PATH.
- Caption generation is based on `microsoft/git-base` and may vary depending on image clarity.
- For faster inference, you can use a optimized model like `llama3.1` or `llama3.2`.
---
## License
This project is for educational and research purposes. For commercial use, please check the licenses of Meta’s LLaMA 3, Microsoft's GIT, and Ultralytics YOLOv8.