{"id":13907316,"url":"https://github.com/jy0205/LaVIT","last_synced_at":"2025-07-18T05:31:24.993Z","repository":{"id":199212669,"uuid":"689177560","full_name":"jy0205/LaVIT","owner":"jy0205","description":"LaVIT: Empower the Large Language Model to Understand and Generate Visual Content","archived":false,"fork":false,"pushed_at":"2024-10-06T15:53:07.000Z","size":87535,"stargazers_count":541,"open_issues_count":8,"forks_count":29,"subscribers_count":14,"default_branch":"main","last_synced_at":"2024-11-25T15:52:30.967Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jy0205.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-09T02:21:27.000Z","updated_at":"2024-11-25T14:56:59.000Z","dependencies_parsed_at":null,"dependency_job_id":"aa389495-54f5-49d4-a004-eaf03f37070c","html_url":"https://github.com/jy0205/LaVIT","commit_stats":null,"previous_names":["jy0205/lavit"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jy0205/LaVIT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy0205%2FLaVIT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy0205%2FLaVIT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy0205%2FLaVIT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy0205%2FLaVIT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jy0205","download_url":"https://codeload.github.com/jy0205/LaVIT/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jy0205%2FLaVIT/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265705342,"owners_count":23814430,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-06T23:01:53.197Z","updated_at":"2025-07-18T05:31:24.835Z","avatar_url":"https://github.com/jy0205.png","language":"Jupyter Notebook","funding_links":[],"categories":["HarmonyOS","多模态大模型"],"sub_categories":["Windows Manager","资源传输下载"],"readme":"# LaVIT: Empower the Large Language Model to Understand and Generate Visual Content\n\nThis is the official repository for the multi-modal large language models: **LaVIT** and **Video-LaVIT**. The LaVIT project aims to leverage the exceptional capability of LLM to deal with visual content. The proposed pre-training strategy supports visual understanding and generation with one unified framework.\n\n* Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization, ICLR 2024, [[`arXiv`](https://arxiv.org/abs/2309.04669)] [[`BibTeX`](#Citing)]\n\n* Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization, ICML 2024 Oral, [[`arXiv`](https://arxiv.org/abs/2402.03161)] [[`Project`](https://video-lavit.github.io)] [[`BibTeX`](#Citing)]\n\n\n\n## News and Updates\n\n* ```2024.06.01``` 👏👏👏 Video-LaVIT has been accepted by ICML 2024 as an Oral presentation!\n\n* ```2024.04.21``` 🚀🚀🚀 We have released the pre-trained weight for **Video-LaVIT** on the HuggingFace and provide the inference code.\n\n* ```2024.02.05``` 🌟🌟🌟  We have proposed the **Video-LaVIT**: an effective multimodal pre-training approach that empowers LLMs to comprehend and generate video content in a unified framework.\n\n* ```2024.01.15``` 👏👏👏 LaVIT has been accepted by ICLR 2024!\n\n* ```2023.10.17``` 🚀🚀🚀  We release the pre-trained weight for **LaVIT** on the HuggingFace and provide the inference code of using it for both multi-modal understanding and generation.\n\n\n## Introduction\nThe **LaVIT** and **Video-LaVIT** are general-purpose multi-modal foundation models that inherit the successful learning paradigm of LLM: predicting the next visual/textual token in an auto-regressive manner. The core design of the LaVIT series works includes a **visual tokenizer** and a **detokenizer**. The visual tokenizer aims to translate the non-linguistic visual content (e.g., image, video) into a sequence of discrete tokens like a foreign language that LLM can read. The detokenizer recovers the generated discrete tokens from LLM to the continuous visual signals.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"LaVIT/assets/pipeline.png\"/\u003e\n\u003c/div\u003e\u003cbr/\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n  LaVIT Pipeline\n\u003c/div\u003e\u003cbr/\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"VideoLaVIT/assets/pipeline.jpg\"/\u003e\n\u003c/div\u003e\u003cbr/\u003e\n\n\u003cdiv align=\"center\"\u003e\n  Video-LaVIT Pipeline\n\u003c/div\u003e\u003cbr/\u003e\n\nAfter pre-training, LaVIT and Video-LaVIT can support\n\n* Read image and video content, generate the captions, and answer the questions.\n* Text-to-image, Text-to-Video and Image-to-Video generation.\n* Generation via Multi-modal Prompt.\n\n## \u003ca name=\"Citing\"\u003e\u003c/a\u003eCitation\nConsider giving this repository a star and cite LaVIT in your publications if it helps your research.\n\n```\n@inproceedings{jin2024unified,\n  title={Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization},\n  author={Jin, Yang and Xu, Kun and Xu, Kun and Chen, Liwei and Liao, Chao and Tan, Jianchao and Mu, Yadong and others},\n  booktitle={International Conference on Learning Representations},\n  year={2024}\n}\n\n@inproceedings{jin2024video,\n  title={Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization},\n  author={Jin, Yang and Sun, Zhicheng and Xu, Kun and Chen, Liwei and Jiang, Hao and Huang, Quzhe and Song, Chengru and Liu, Yuliang and Zhang, Di and Song, Yang and Gai, Kun and Mu, Yadong},\n  booktitle={International Conference on Machine Learning},\n  pages={22185--22209},\n  year={2024}\n}","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjy0205%2FLaVIT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjy0205%2FLaVIT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjy0205%2FLaVIT/lists"}