https://github.com/tensorsense/vlm_databuilder

This SDK generates datasets for training Video LLMs from youtube videos.
https://github.com/tensorsense/vlm_databuilder

data-generation data-science llm video-llms vlm

Last synced: 4 months ago
JSON representation

This SDK generates datasets for training Video LLMs from youtube videos.

Host: GitHub
URL: https://github.com/tensorsense/vlm_databuilder
Owner: tensorsense
License: mit
Created: 2024-07-03T13:21:39.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-08-28T07:34:42.000Z (8 months ago)
Last Synced: 2024-08-28T12:45:17.592Z (8 months ago)
Topics: data-generation, data-science, llm, video-llms, vlm
Language: Python
Homepage: https://tensorsense.com
Size: 571 KB
Stars: 5
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome_ai_agents - Vlm_Databuilder - This SDK generates datasets for training Video LLMs from youtube videos. (Building / Datasets)
awesome_ai_agents - Vlm_Databuilder - This SDK generates datasets for training Video LLMs from youtube videos. (Building / Datasets)

README

# TensorSense Data Generation SDK

This SDK generates datasets for training Video LLMs from youtube videos. More sources coming later!

## 🐠 What it Does
- Generate search queries with GPT.
- Search for youtube videos for each query using [scrapetube](https://github.com/dermasmid/scrapetube).
- Download the videos that were found and subtitles using [yt-dlp](https://github.com/yt-dlp/yt-dlp).
- Detect segments from each video using CLIP and a fancy manual algorithm.
- Generate annotations for each segment with GPT using audio transcript (eg instructions) in 2 steps: first extract clues from the trancript, then generate annotations based on these clues.
- Aggregate segments with annotations into one file
- Cut segments into separate video clips with [ffmpeg](https://ffmpeg.org/).

In the end you'll have a directory with useful video clips and an annotation file, which you can then train a model on.

## 🐬 Installation
- `pip install -r requirements.txt`. If it doesn't work, try updating `pip install -U -r requirements.txt`.
- make `.env` file with:
- `OPENAI_API_KEY` for openai
- `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_API_KEY` for azure
- `OPENAI_API_VERSION='2023-07-01-preview'`
- set config params in the notebook:
- `openai.type`: openai/azure
- `openai.temperature`: the bigger, the more random/creative output will be
- `openai.deployment`: model for openai / deployment for azure. Needs to be able to do structured output and process images. Tested on gpt4o on azure.
- `data_dir`: the path where all the results will be saved. Change it for each experiment/dataset.

## 🐙 Usage

Please refer to [getting_started.ipynb](./getting_started.ipynb)

If you have your own videos with descriptions, you can skip the download/filtering steps and move straight to generating annotaions!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tensorsense/vlm_databuilder

Awesome Lists containing this project

README