{"id":23932886,"url":"https://github.com/tensorsense/vlm_databuilder","last_synced_at":"2025-09-11T15:32:51.157Z","repository":{"id":247465201,"uuid":"823665769","full_name":"tensorsense/vlm_databuilder","owner":"tensorsense","description":"This SDK generates datasets for training Video LLMs from youtube videos.","archived":false,"fork":false,"pushed_at":"2024-08-28T07:34:42.000Z","size":585,"stargazers_count":5,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-08-28T12:45:17.592Z","etag":null,"topics":["data-generation","data-science","llm","video-llms","vlm"],"latest_commit_sha":null,"homepage":"https://tensorsense.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tensorsense.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-03T13:21:39.000Z","updated_at":"2024-08-25T17:25:04.000Z","dependencies_parsed_at":"2024-08-26T12:17:50.564Z","dependency_job_id":null,"html_url":"https://github.com/tensorsense/vlm_databuilder","commit_stats":null,"previous_names":["tensorsense/datagen_sdk","tensorsense/vlm_databuilder"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorsense%2Fvlm_databuilder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorsense%2Fvlm_databuilder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorsense%2Fvlm_databuilder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tensorsense%2Fvlm_databuilder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tensorsense","download_url":"https://codeload.github.com/tensorsense/vlm_databuilder/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232657722,"owners_count":18556888,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-generation","data-science","llm","video-llms","vlm"],"created_at":"2025-01-06T00:29:24.950Z","updated_at":"2025-01-06T00:29:51.326Z","avatar_url":"https://github.com/tensorsense.png","language":"Python","funding_links":[],"categories":["Building"],"sub_categories":["Datasets"],"readme":"# TensorSense Data Generation SDK\n\nThis SDK generates datasets for training Video LLMs from youtube videos. More sources coming later!\n\n## 🐠 What it Does\n- Generate search queries with GPT.\n- Search for youtube videos for each query using [scrapetube](https://github.com/dermasmid/scrapetube).\n- Download the videos that were found and subtitles using [yt-dlp](https://github.com/yt-dlp/yt-dlp).\n- Detect segments from each video using CLIP and a fancy manual algorithm.\n- Generate annotations for each segment with GPT using audio transcript (eg instructions) in 2 steps: first extract clues from the trancript, then generate annotations based on these clues.\n- Aggregate segments with annotations into one file\n- Cut segments into separate video clips with [ffmpeg](https://ffmpeg.org/).\n\nIn the end you'll have a directory with useful video clips and an annotation file, which you can then train a model on.\n\n## 🐬 Installation\n- `pip install -r requirements.txt`. If it doesn't work, try updating `pip install -U -r requirements.txt`.\n- make `.env` file with:\n    - `OPENAI_API_KEY` for openai\n    - `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_API_KEY` for azure\n    - `OPENAI_API_VERSION='2023-07-01-preview'`\n- set config params in the notebook:\n    - `openai.type`: openai/azure\n    - `openai.temperature`: the bigger, the more random/creative output will be\n    - `openai.deployment`: model for openai / deployment for azure. Needs to be able to do structured output and process images. Tested on gpt4o on azure.\n    - `data_dir`: the path where all the results will be saved. Change it for each experiment/dataset.\n\n## 🐙 Usage\n\nPlease refer to [getting_started.ipynb](./getting_started.ipynb)\n\nIf you have your own videos with descriptions, you can skip the download/filtering steps and move straight to generating annotaions!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorsense%2Fvlm_databuilder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftensorsense%2Fvlm_databuilder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftensorsense%2Fvlm_databuilder/lists"}