{"id":19279782,"url":"https://github.com/showlab/univtg","last_synced_at":"2025-04-05T23:08:29.499Z","repository":{"id":185042939,"uuid":"646096023","full_name":"showlab/UniVTG","owner":"showlab","description":"[ICCV 2023] UniVTG: Towards Unified Video-Language Temporal Grounding","archived":false,"fork":false,"pushed_at":"2024-05-08T15:15:34.000Z","size":23760,"stargazers_count":345,"open_issues_count":19,"forks_count":32,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-29T22:06:23.971Z","etag":null,"topics":["highlight-detection","moment-retrieval","pretraining","video-grounding","video-language","video-summarization"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2307.16715","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/showlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-27T09:16:48.000Z","updated_at":"2025-03-28T04:49:32.000Z","dependencies_parsed_at":null,"dependency_job_id":"5f7ed0e8-cb60-4fc2-a461-f33d86a8ab83","html_url":"https://github.com/showlab/UniVTG","commit_stats":{"total_commits":131,"total_committers":4,"mean_commits":32.75,"dds":"0.022900763358778664","last_synced_commit":"32659ac7aeba21742a63274f30eba785fc57e247"},"previous_names":["showlab/univtg"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FUniVTG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FUniVTG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FUniVTG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FUniVTG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/showlab","download_url":"https://codeload.github.com/showlab/UniVTG/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247411234,"owners_count":20934653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["highlight-detection","moment-retrieval","pretraining","video-grounding","video-language","video-summarization"],"created_at":"2024-11-09T21:16:06.711Z","updated_at":"2025-04-05T23:08:29.478Z","avatar_url":"https://github.com/showlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":" # UniVTG (ICCV'23)\n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/univtg-towards-unified-video-language/highlight-detection-on-qvhighlights)](https://paperswithcode.com/sota/highlight-detection-on-qvhighlights?p=univtg-towards-unified-video-language) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/univtg-towards-unified-video-language/moment-retrieval-on-qvhighlights)](https://paperswithcode.com/sota/moment-retrieval-on-qvhighlights?p=univtg-towards-unified-video-language)\n\n[[arXiv]](https://arxiv.org/abs/2307.16715) \u003ca src=\"https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue\" href=\"https://huggingface.co/spaces/KevinQHLin/UniVTG\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/%F0%9F%A4%97-Open%20in%20Spaces-blue\" alt=\"Open in Spaces\"\u003e\n\u003ca src=\"https://img.shields.io/twitter/url?color=blue\u0026label=Tweet\u0026logo=twitter\u0026url=https%3A%2F%2Ftwitter.com%2FKevinQHLin%2Fstatus%2F1649124447037841408\" href=\"https://twitter.com/KevinQHLin/status/1686223119718006784\"\u003e\n    \u003cimg src=\"https://img.shields.io/twitter/url?color=blue\u0026label=Tweet\u0026logo=twitter\u0026url=https%3A%2F%2Ftwitter.com%2FKevinQHLin%2Fstatus%2F1649124447037841408\" alt=\"Tweet\"\u003e\n\u003c/a\u003e\n \n\u003e **TL; DR:** The first video temporal grounding pretraining model, unifying diverse temporal annotations to power moment retrieval (interval), highlight detection (curve) and video summarization (point).\n\n![UniVTG](figures/univtg_demo.gif)\n\n### 📢 News\n\u003c!--  --\u003e\n- [2023.10.15] Upload the Clip teacher scripts to create scalable pseudo annotations.\n- [2023.8.22] Code cleaning, add training/inference instruction, upload all downstream checkpoints.\n- [2023.8.6] Create the [Huggingface space demo](https://huggingface.co/spaces/KevinQHLin/UniVTG)!\n- [2023.7.31] We release the arXiv paper, codes, checkpoints, and gradio demo.\n\n### 📝 Todo\n- [ ] Connect UniVTG with LLM e.g., ChatGPT.\n- [x] Upload all downstream checkpoints.\n- [x] Upload all pretraining checkpoints.\n\n## 🌟 Run on your video:\nTo power practical usage, we release the following checkpoints:\n\n*can be run on a single GPU with less than 4GB memory, highly efficient, less than 1 sec to perform temporal grounding even a 10 minutes long video.*\n\n\u003e \n| Video Enc.  | Text Enc.  | Pretraining            | Fine-tuning   |  Checkpoints |\n| ------------------ |  ------------------ | ------------------ | ------- | ---- |\n| CLIP-B/16 | CLIP-B/16 | 4M      | -      |   [Google Drive](https://drive.google.com/drive/folders/1-eGata6ZPV0A1BBsZpYyIooos9yjMx2f?usp=sharing)  |\n| CLIP-B/16 | CLIP-B/16 | 4M | QVHL + Charades + NLQ + TACoS + ActivityNet + DiDeMo      |  [Google Drive](https://drive.google.com/drive/folders/1l6RyjGuqkzfZryCC6xwTZsvjWaIMVxIO?usp=sharing)  \n\nDownload checkpoint and put it in the dir `results/omni`.\n\nDownload the example videos from [here](https://drive.google.com/drive/folders/1TpMYRmdAx5yx-lQu4ivCnAX67voUfBcL?usp=sharing) and put it under `examples/`\n\nRun `python3 main_gradio.py --resume ./results/omni/model_best.ckpt`\n\n\u003cdetails open\u003e\u003csummary\u003e[ Youtube video ]\u003c/summary\u003e\u003cimg src=\"./figures/case1.jpg\" alt=\"Youtube video\" style=\"width: 100%; height: auto;\"\u003e\n\u003c/details\u003e\n\u003cdetails open\u003e\u003csummary\u003e[ Egocentric video ]\u003c/summary\u003e\u003cimg src=\"./figures/case3.jpg\" alt=\"Egocentric video\" style=\"width: 100%; height: auto;\"\u003e\n\u003c/details\u003e\n\u003cdetails open\u003e\u003csummary\u003e[ Charades video  ]\u003c/summary\u003e\u003cimg src=\"./figures/case2.jpg\" alt=\"Charades video\" style=\"width: 100%; height: auto;\"\u003e\n\u003c/details\u003e\n\n## ⚙️ Preparation\n\nPlease find instructions in [install.md](install.md) to setup environment and datasets.\n\n## 📦 Model Zoo\n\nDownload checkpoints in [model.md](model.md) to reproduce the benchmark results.\n\n## 🚀 Training \u0026 Inference\n\u003e We use slurm for job running, you may need to slightly modify the code to adapt your environment if you do not use slurm system.\n### Pretraining (multi-gpu)\n\nLarge-scale pretraining: `bash scripts/pretrain.sh`\n\nMulti-datasets co-training: `bash scripts/cotrain.sh`\n\n### Downstream (single-gpu)\n*Indicate `--resume` to init model by pretraining weight. **Refer to our model zoo for detailed parameter settings***\n\nTraining: `bash scripts/qvhl_pretrain.sh`\n\n\n*Indicate `--eval_init` and `--n_epoch=0` to evaluate selected checkpoint `--resume`.*\n\nInference: `bash scripts/qvhl_inference.sh`\n\n### CLIP teacher to create scalable pseudo labels\n\n1. Download the openimages v6 class list from `https://storage.googleapis.com/openimages/v6/oidv6-class-descriptions.csv`.\n\n2. Convert it as json by `python3 teacher/csv2json.py` then extract the textual class features by `python3 teacher/label2feature.py`\n\n3. (Before this, you should have extracted the video features of the video) Run the script to generate pseudo labels `python3 teacher/clip2labels.py`\n\n\n## 🎨 Visualization\n\nIf you want to draw visualizations like our paper, you can simply run `python3 plot/qvhl.py` to generate corresponding figures by providing the prediction jsons (you can download them in [Model Zoo](https://github.com/showlab/UniVTG/blob/main/model.md)).\n\n![visualization](figures/plot_qvhl.jpg)\n\n## 🎓 Citation\nIf you find our work helps, please cite our paper.\n\n```\n@misc{lin2023univtg,\n      title={UniVTG: Towards Unified Video-Language Temporal Grounding}, \n      author={Kevin Qinghong Lin and Pengchuan Zhang and Joya Chen and Shraman Pramanick and Difei Gao and Alex Jinpeng Wang and Rui Yan and Mike Zheng Shou},\n      year={2023},\n      eprint={2307.16715},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n## ✉️ Contact\nThis repo is maintained by [Kevin](https://qinghonglin.github.io/). Questions and discussions are welcome via kevin.qh.lin@gmail.com or open an issue.\n\n## 😊 Acknowledgement\n\nThis codebase is based on [moment_detr](https://github.com/jayleicn/moment_detr), [HERO_Video_Feature_Extractor](https://github.com/linjieli222/HERO_Video_Feature_Extractor), [UMT](https://github.com/tencentarc/umt).\n\nWe thank the authors for their open-source contributions.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Funivtg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshowlab%2Funivtg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Funivtg/lists"}