Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tensorsense/vlm_databuilder

This SDK generates datasets for training Video LLMs from youtube videos.
https://github.com/tensorsense/vlm_databuilder

data-generation data-science llm video-llms vlm

Last synced: 3 days ago
JSON representation

This SDK generates datasets for training Video LLMs from youtube videos.

Awesome Lists containing this project

README

        

# TensorSense Data Generation SDK

This SDK generates datasets for training Video LLMs from youtube videos. More sources coming later!

## 🐠 What it Does
- Generate search queries with GPT.
- Search for youtube videos for each query using [scrapetube](https://github.com/dermasmid/scrapetube).
- Download the videos that were found and subtitles using [yt-dlp](https://github.com/yt-dlp/yt-dlp).
- Detect segments from each video using CLIP and a fancy manual algorithm.
- Generate annotations for each segment with GPT using audio transcript (eg instructions) in 2 steps: first extract clues from the trancript, then generate annotations based on these clues.
- Aggregate segments with annotations into one file
- Cut segments into separate video clips with [ffmpeg](https://ffmpeg.org/).

In the end you'll have a directory with useful video clips and an annotation file, which you can then train a model on.

## 🐬 Installation
- `pip install -r requirements.txt`. If it doesn't work, try updating `pip install -U -r requirements.txt`.
- make `.env` file with:
- `OPENAI_API_KEY` for openai
- `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_API_KEY` for azure
- `OPENAI_API_VERSION='2023-07-01-preview'`
- set config params in the notebook:
- `openai.type`: openai/azure
- `openai.temperature`: the bigger, the more random/creative output will be
- `openai.deployment`: model for openai / deployment for azure. Needs to be able to do structured output and process images. Tested on gpt4o on azure.
- `data_dir`: the path where all the results will be saved. Change it for each experiment/dataset.

## 🐙 Usage

Please refer to [getting_started.ipynb](./getting_started.ipynb)

If you have your own videos with descriptions, you can skip the download/filtering steps and move straight to generating annotaions!