Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tensorsense/vlm_databuilder
This SDK generates datasets for training Video LLMs from youtube videos.
https://github.com/tensorsense/vlm_databuilder
data-generation data-science llm video-llms vlm
Last synced: 3 days ago
JSON representation
This SDK generates datasets for training Video LLMs from youtube videos.
- Host: GitHub
- URL: https://github.com/tensorsense/vlm_databuilder
- Owner: tensorsense
- License: mit
- Created: 2024-07-03T13:21:39.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-08-28T07:34:42.000Z (4 months ago)
- Last Synced: 2024-08-28T12:45:17.592Z (4 months ago)
- Topics: data-generation, data-science, llm, video-llms, vlm
- Language: Python
- Homepage: https://tensorsense.com
- Size: 571 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome_ai_agents - Vlm_Databuilder - This SDK generates datasets for training Video LLMs from youtube videos. (Building / Datasets)
- awesome_ai_agents - Vlm_Databuilder - This SDK generates datasets for training Video LLMs from youtube videos. (Building / Datasets)
README
# TensorSense Data Generation SDK
This SDK generates datasets for training Video LLMs from youtube videos. More sources coming later!
## 🐠 What it Does
- Generate search queries with GPT.
- Search for youtube videos for each query using [scrapetube](https://github.com/dermasmid/scrapetube).
- Download the videos that were found and subtitles using [yt-dlp](https://github.com/yt-dlp/yt-dlp).
- Detect segments from each video using CLIP and a fancy manual algorithm.
- Generate annotations for each segment with GPT using audio transcript (eg instructions) in 2 steps: first extract clues from the trancript, then generate annotations based on these clues.
- Aggregate segments with annotations into one file
- Cut segments into separate video clips with [ffmpeg](https://ffmpeg.org/).In the end you'll have a directory with useful video clips and an annotation file, which you can then train a model on.
## 🐬 Installation
- `pip install -r requirements.txt`. If it doesn't work, try updating `pip install -U -r requirements.txt`.
- make `.env` file with:
- `OPENAI_API_KEY` for openai
- `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_API_KEY` for azure
- `OPENAI_API_VERSION='2023-07-01-preview'`
- set config params in the notebook:
- `openai.type`: openai/azure
- `openai.temperature`: the bigger, the more random/creative output will be
- `openai.deployment`: model for openai / deployment for azure. Needs to be able to do structured output and process images. Tested on gpt4o on azure.
- `data_dir`: the path where all the results will be saved. Change it for each experiment/dataset.## 🐙 Usage
Please refer to [getting_started.ipynb](./getting_started.ipynb)
If you have your own videos with descriptions, you can skip the download/filtering steps and move straight to generating annotaions!