https://github.com/byjlw/video-analyzer

Analyze videos using LLMs, Computer Vision and Automatic Speech Recognition
https://github.com/byjlw/video-analyzer

asr llms video video-processing

Last synced: 2 months ago
JSON representation

Analyze videos using LLMs, Computer Vision and Automatic Speech Recognition

Host: GitHub
URL: https://github.com/byjlw/video-analyzer
Owner: byjlw
License: apache-2.0
Created: 2024-11-10T17:49:24.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2026-03-19T06:03:18.000Z (2 months ago)
Last Synced: 2026-03-19T21:21:09.552Z (2 months ago)
Topics: asr, llms, video, video-processing
Language: Python
Homepage:
Size: 366 KB
Stars: 1,321
Watchers: 19
Forks: 183
Open Issues: 2
Metadata Files:
- Readme: readme.md
- Contributing: docs/CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

# Video Analysis using vision models like Llama3.2 Vision and OpenAI's Whisper Models

A video analysis tool that combines vision models like Llama's 11B vision model and Whisper to create a description by taking key frames, feeding them to the vision model to get details. It uses the details from each frame and the transcript, if available, to describe what's happening in the video.

## Table of Contents
- [Features](#features)
- [Requirements](#requirements)
- [System Requirements](#system-requirements)
- [Installation](#installation)
- [Ollama Setup](#ollama-setup)
- [OpenAI-compatible API Setup](#openai-compatible-api-setup-optional)
- [Usage](#usage)
- [Quick Start](#quick-start)
- [Sample Output](#sample-output)
- [Complete Usage Guide](docs/USAGES.md)
- [Design](#design)
- [Detailed Design Documentation](docs/DESIGN.md)
- [Project Structure](#project-structure)
- [Configuration](#configuration)
- [Output](#output)
- [Prompt Tuning](#prompt-tuning)
- [Uninstallation](#uninstallation)
- [License](#license)
- [Contributing](#contributing)

## Features
- 💻 Can run completely locally - no cloud services or API keys needed
- ☁️ Or, leverage any OpenAI API compatible LLM service (openrouter, openai, etc) for speed and scale
- 🎬 Intelligent key frame extraction from videos
- 🔊 High-quality audio transcription using OpenAI's Whisper
- 👁️ Frame analysis using Ollama and Llama3.2 11B Vision Model
- 📝 Natural language descriptions of video content
- 🔄 Automatic handling of poor quality audio
- 📊 Detailed JSON output of analysis results
- ⚙️ Highly configurable through command line arguments or config file

## Design
The system operates in three stages:

1. Frame Extraction & Audio Processing
- Uses OpenCV to extract key frames
- Processes audio using Whisper for transcription
- Handles poor quality audio with confidence checks

2. Frame Analysis
- Analyzes each frame using vision LLM
- Each analysis includes context from previous frames
- Maintains chronological progression
- Uses frame_analysis.txt prompt template

3. Video Reconstruction
- Combines frame analyses chronologically
- Integrates audio transcript
- Uses first frame to set the scene
- Creates comprehensive video description

![Design](docs/design.png)

## Requirements

### System Requirements
- Python 3.11 or higher
- FFmpeg (required for audio processing)
- When running LLMs locally (not necessary when using openrouter)
- At least 16GB RAM (32GB recommended)
- GPU at least 12GB of VRAM or Apple M Series with at least 32GB

### Installation

1. Clone the repository:
```bash
git clone https://github.com/byjlw/video-analyzer.git
cd video-analyzer
```

2. Create and activate a virtual environment:
```bash
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```

3. Install the package:
```bash
pip install . # For regular installation
# OR
pip install -e . # For development installation
```

4. Install FFmpeg:
- Ubuntu/Debian:
```bash
sudo apt-get update && sudo apt-get install -y ffmpeg
```
- macOS:
```bash
brew install ffmpeg
```
- Windows:
```bash
choco install ffmpeg
```

### Ollama Setup

1. Install Ollama following the instructions at [ollama.ai](https://ollama.ai)

2. Pull the default vision model:
```bash
ollama pull llama3.2-vision
```

3. Start the Ollama service:
```bash
ollama serve
```

### OpenAI-compatible API Setup (Optional)

If you want to use OpenAI-compatible APIs (like OpenRouter or OpenAI) instead of Ollama:

1. Get an API key from your provider:
- [OpenRouter](https://openrouter.ai)
- [OpenAI](https://platform.openai.com)

2. Configure via command line:
```bash
# For OpenRouter
video-analyzer video.mp4 --client openai_api --api-key your-key --api-url https://openrouter.ai/api/v1 --model gpt-4o

# For OpenAI
video-analyzer video.mp4 --client openai_api --api-key your-key --api-url https://api.openai.com/v1 --model gpt-4o
```

Or add to config/config.json:
```json
{
"clients": {
"default": "openai_api",
"openai_api": {
"api_key": "your-api-key",
"api_url": "https://openrouter.ai/api/v1" # or https://api.openai.com/v1
}
}
}
```

Note: With OpenRouter, you can use llama 3.2 11b vision for free by adding :free to the model name

## Design
For detailed information about the project's design and implementation, including how to make changes, see [docs/DESIGN.md](docs/DESIGN.md).

## Usage

For detailed usage instructions and all available options, see [docs/USAGES.md](docs/USAGES.md).

### Quick Start

```bash
# Local analysis with Ollama (default)
video-analyzer video.mp4

# Cloud analysis with OpenRouter
video-analyzer video.mp4 \
--client openai_api \
--api-key your-key \
--api-url https://openrouter.ai/api/v1 \
--model meta-llama/llama-3.2-11b-vision-instruct:free

# Analysis with custom prompt
video-analyzer video.mp4 \
--prompt "What activities are happening in this video?" \
--whisper-model large
```

## Output

The tool generates a JSON file (`output\analysis.json`) containing:
- Metadata about the analysis
- Audio transcript (if available)
- Frame-by-frame analysis
- Final video description

### Sample Output
```
The video begins with a person with long blonde hair, wearing a pink t-shirt and yellow shorts, standing in front of a black plastic tub or container on wheels. The ground appears to be covered in wood chips.\n\nAs the video progresses, the person remains facing away from the camera, looking down at something inside the tub. ........
```
full sample output in `docs/sample_analysis.json`
## Configuration

The tool uses a cascading configuration system with command line arguments taking highest priority, followed by user config (config/config.json), and finally the default config. See [docs/USAGES.md](docs/USAGES.md) for detailed configuration options.

## Prompt Tuning

The prompts that drive frame analysis and video reconstruction can be automatically optimized for your specific content and use case using [video-analyzer-tune](video-analyzer-tune/README.md).

```bash
pip install video-analyzer-tune
```

Run `video-analyzer` on a few representative videos, edit the outputs to show what ideal results look like, then let DSPy MIPROv2 find better prompt instructions automatically. The tuned prompts are written as new files you point to via your config — the main package is unaffected.

See [video-analyzer-tune/README.md](video-analyzer-tune/README.md) for full instructions.

## Uninstallation

To uninstall the package:
```bash
pip uninstall video-analyzer
```

## License

Apache License

## Contributing

We welcome contributions! Please see [docs/CONTRIBUTING.md](docs/CONTRIBUTING.md) for detailed guidelines on how to:
- Review the project design
- Propose changes through GitHub Discussions
- Submit pull requests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/byjlw/video-analyzer

Awesome Lists containing this project

README