{"id":13958478,"url":"https://github.com/sally-sh/vsp-llm","last_synced_at":"2025-07-21T00:31:03.636Z","repository":{"id":223827237,"uuid":"761539619","full_name":"Sally-SH/VSP-LLM","owner":"Sally-SH","description":null,"archived":false,"fork":false,"pushed_at":"2024-05-19T06:49:21.000Z","size":19572,"stargazers_count":298,"open_issues_count":1,"forks_count":25,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-11-07T04:40:31.452Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sally-SH.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-22T02:40:04.000Z","updated_at":"2024-10-17T13:59:08.000Z","dependencies_parsed_at":"2024-05-19T07:47:09.452Z","dependency_job_id":null,"html_url":"https://github.com/Sally-SH/VSP-LLM","commit_stats":null,"previous_names":["sally-sh/vsp-llm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sally-SH%2FVSP-LLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sally-SH%2FVSP-LLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sally-SH%2FVSP-LLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sally-SH%2FVSP-LLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sally-SH","download_url":"https://codeload.github.com/Sally-SH/VSP-LLM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226850003,"owners_count":17691896,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-08T13:01:37.599Z","updated_at":"2025-07-21T00:31:03.631Z","avatar_url":"https://github.com/Sally-SH.png","language":"Python","funding_links":[],"categories":["其他_机器视觉"],"sub_categories":["网络服务_其他"],"readme":"# VSP-LLM (Visual Speech Processing incorporated with LLMs)\n\nThis is the PyTorch code for [Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing](https://arxiv.org/abs/2402.15151). This code is developed on the code of [AV-HuBERT](https://github.com/facebookresearch/av_hubert).\n\n## Introduction\n\nWe propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of a LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptors (LoRA), VSP-LLM can be trained in a computationally efficient manner.\n\n![vsr-vst](docs/demo.gif)\n\n## Model checkpoint\n\nYou can find checkpoint of our model in [here](https://drive.google.com/drive/folders/1aBnm8XOWlRAGjPwcK2mYEGd8insNCx13?usp=sharing).\nMove the checkpoint to [`checkpoints`](checkpoints/).\n\n## Preparation\n\n```\nconda create -n vsp-llm python=3.9 -y\nconda activate vsp-llm\ngit clone https://github.com/Sally-SH/VSP-LLM.git\ncd VSP-LLM\npip install -r requirements.txt\ncd fairseq\npip install --editable ./\n```\n\n- Download AV-HuBERT pre-trained model `AV-HuBERT Large (LSR3 + VoxCeleb2)` from [here](http://facebookresearch.github.io/av_hubert).\n- Download LLaMA2-7B from [here](https://huggingface.co/meta-llama/Llama-2-7b-hf).\n\nMove the AV-HuBERT pre-trained model checkpoint and the LLaMA2-7B checkpoint to [`checkpoints`](checkpoints/).\n\n## Data preprocessing\nFollow [Auto-AVSR preparation](https://github.com/mpc001/auto_avsr/tree/main/preparation) to preprocess the LRS3 dataset.\\\nThen, follow [AV-HuBERT preparation](https://github.com/facebookresearch/av_hubert/tree/main/avhubert/preparation) from step 3 to create manifest of LRS3 dataset.\n\n### Generate visual speech unit and cluster counts file\nFollow the steps in [`clustering`](src/clustering/) to create:\n- `{train,valid}.km` frame-aligned pseudo label files.\nThe `label_rate` is the same as the feature frame rate used for clustering,\nwhich is 25Hz for AV-HuBERT features by default.\n\n### Dataset layout\n\n    .\n    ├── lrs3\n    │     ├── lrs3_video_seg24s               # Preprocessed video and audio data\n    │     └── lrs3_text_seg24s                # Preprocessed text data\n    ├── muavic_dataset                        # Mix of VSR data and VST(En-X) data\n    │     ├── train.tsv                       # List of audio and video path for training\n    │     ├── train.wrd                       # List of target label for training\n    │     ├── train.cluster_counts            # List of clusters to deduplicate speech units in training\n    │     ├── test.tsv                        # List of audio and video path for testing\n    │     ├── test.wrd                        # List of target label for testing\n    │     └── test.cluster_counts             # List of clusters to deduplicate speech units in testing\n    └── test_data\n          ├── vsr\n          │    └── en\n          │        ├── test.tsv \n          │        ├── test.wrd  \n          │        └── test.cluster_counts           \n          └── vst\n               └── en\n                   ├── es\n                   :   ├── test.tsv\n                   :   ├── test.wrd \n                   :   └── test.cluster_counts\n                   └── pt\n                       ├── test.tsv\n                       ├── test.wrd \n                       └── test.cluster_counts\n\n### Test data\nThe test manifest is provided in [`labels`](labels/). You need to replace the path of the LRS3 in the manifest file with your preprocessed LRS3 dataset path using the following command:\n```bash\ncd src/dataset\npython replace_path.py --lrs3 /path/to/lrs3\n```\nThen modified test amanifest is saved in [`dataset`](src/dataset/)\n\n## Training\n\nOpen the training script ([`scripts/train.sh`](https://github.com/Sally-SH/VSP-LLM/blob/main/scripts/train.sh)) and replace these variables:\n\n```bash\n# path to train dataset dir\nDATA_PATH=???\n\n# path where output trained models will be located\nOUT_PATH=???\n```\n\nRun the training script:\n\n```bash\n$ bash scripts/train.sh\n```\n\n## Decoding\n\nOpen the decoding script ([`scripts/decode.sh`](https://github.com/Sally-SH/VSP-LLM/blob/main/scripts/decode.sh)) and replace these variables:\n\n```bash\n# language direction (e.g 'en' for VSR task / 'en-es' for En to Es VST task)\nLANG=???\n\n# path to the trained model\nMODEL_PATH=???\n\n# path where decoding results and scores will be located\nOUT_PATH=???\n```\n\nRun the decoding script:\n\n```bash\n$ bash scripts/decode.sh\n```\n\n## Citation\nIf you find this repository helpful, please use the following BibTeX entry for citation.\n``` BibTeX\n@inproceedings{yeo2024visual,\n  title={Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing},\n  author={Yeo, Jeonghun and Han, Seunghee and Kim, Minsu and Ro, Yong Man},\n  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},\n  pages={11391--11406},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsally-sh%2Fvsp-llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsally-sh%2Fvsp-llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsally-sh%2Fvsp-llm/lists"}