{"id":26245217,"url":"https://github.com/evolvinglmms-lab/longva","last_synced_at":"2025-04-12T01:59:37.774Z","repository":{"id":245640305,"uuid":"818148529","full_name":"EvolvingLMMs-Lab/LongVA","owner":"EvolvingLMMs-Lab","description":"Long Context Transfer from Language to Vision","archived":false,"fork":false,"pushed_at":"2025-03-18T07:59:52.000Z","size":37910,"stargazers_count":371,"open_issues_count":27,"forks_count":19,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-04-12T01:59:25.123Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EvolvingLMMs-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-21T07:49:26.000Z","updated_at":"2025-04-04T02:59:47.000Z","dependencies_parsed_at":"2024-07-10T05:47:04.623Z","dependency_job_id":"7fdd84f2-b275-462c-8f14-674d0d7624f3","html_url":"https://github.com/EvolvingLMMs-Lab/LongVA","commit_stats":null,"previous_names":["evolvinglmms-lab/longva"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FLongVA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FLongVA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FLongVA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FLongVA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EvolvingLMMs-Lab","download_url":"https://codeload.github.com/EvolvingLMMs-Lab/LongVA/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248505862,"owners_count":21115354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-13T12:32:12.844Z","updated_at":"2025-04-12T01:59:37.755Z","avatar_url":"https://github.com/EvolvingLMMs-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LongVA \u0026 V-NIAH\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"vision_niah/niah_output/LongVA-7B/heatmap.png\" width=\"800\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n    🌐 \u003ca href=\"https://lmms-lab.github.io/posts/longva/\" target=\"_blank\"\u003eBlog\u003c/a\u003e | 📃 \u003ca href=\"https://arxiv.org/abs/2406.16852\" target=\"_blank\"\u003ePaper\u003c/a\u003e | 🤗 \u003ca href=\"https://huggingface.co/collections/lmms-lab/longva-667538e09329dbc7ea498057\" target=\"_blank\"\u003eHugging Face\u003c/a\u003e | 🎥 \u003ca href=\"https://longva-demo.lmms-lab.com/\" target=\"_blank\"\u003eDemo\u003c/a\u003e\n\n\u003c/p\u003e\n\n![Static Badge](https://img.shields.io/badge/lmms--eval-certified-red?link=https%3A%2F%2Fgithub.com%2FEvolvingLMMs-Lab%2Flmms-eval)  ![Static Badge](https://img.shields.io/badge/llava--next-credit-red?link=https%3A%2F%2Fgithub.com%2FLLaVA-VL%2FLLaVA-NeXT)\n\nLong context capability can **zero-shot transfer** from language to vision.\n\nLongVA can process **2000** frames or over **200K** visual tokens. It achieves **state-of-the-art** performance on Video-MME among 7B models.\n\n## News\n\n\n- [2024/08/08] 🔥 Released training code for vision text alignment.\n- [2024/06/24] 🔥 LongVA is released. Training code for vision text alignment is coming soon.\n  \n## Installation \nThis codebase is tested on CUDA 11.8 and A100-SXM-80G.\n```bash\nconda create -n longva python=3.10 -y \u0026\u0026 conda activate longva\npip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118\npip install -e \"longva/.[train]\"\npip install packaging \u0026\u0026  pip install ninja \u0026\u0026 pip install flash-attn==2.5.0 --no-build-isolation --no-cache-dir\npip install -r requirements.txt\n```\n\n\n## Local Demo\n\n```bash\n# For CLI inference\npip install httpx==0.23.3\npython local_demo/longva_backend.py --video_path local_demo/assets/dc_demo.mp4 --question \"What does this video show?\" --num_sampled_frames 32 --device auto\npython local_demo/longva_backend.py --image_path local_demo/assets/lmms-eval.png --question \"What is inside the image?\"\n\n# For multimodal chat demo with gradio UI\npython local_demo/multimodal_chat.py\n```\n\n### Quick Start With HuggingFace\n\n\u003cdetails\u003e\n    \u003csummary\u003eExample Code\u003c/summary\u003e\n    \n```python\nfrom longva.model.builder import load_pretrained_model\nfrom longva.mm_utils import tokenizer_image_token, process_images\nfrom longva.constants import IMAGE_TOKEN_INDEX\nfrom PIL import Image\nfrom decord import VideoReader, cpu\nimport torch\nimport numpy as np\n# fix seed\ntorch.manual_seed(0)\n\nmodel_path = \"lmms-lab/LongVA-7B-DPO\"\nimage_path = \"local_demo/assets/lmms-eval.png\"\nvideo_path = \"local_demo/assets/dc_demo.mp4\"\nmax_frames_num = 16 # you can change this to several thousands so long you GPU memory can handle it :)\ngen_kwargs = {\"do_sample\": True, \"temperature\": 0.5, \"top_p\": None, \"num_beams\": 1, \"use_cache\": True, \"max_new_tokens\": 1024}\n# you can also set the device map to auto to accomodate more frames\ntokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, \"llava_qwen\", device_map=\"cuda:0\")\n\n\n#image input\nprompt = \"\u003c|im_start|\u003esystem\\nYou are a helpful assistant.\u003c|im_end|\u003e\\n\u003c|im_start|\u003euser\\n\u003cimage\u003e\\nDescribe the image in details.\u003c|im_end|\u003e\\n\u003c|im_start|\u003eassistant\\n\"\ninput_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0).to(model.device)\nimage = Image.open(image_path).convert(\"RGB\")\nimages_tensor = process_images([image], image_processor, model.config).to(model.device, dtype=torch.float16)\nwith torch.inference_mode():\n    output_ids = model.generate(input_ids, images=images_tensor, image_sizes=[image.size], modalities=[\"image\"], **gen_kwargs)\noutputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()\nprint(outputs)\nprint(\"-\"*50)\n\n#video input\nprompt = \"\u003c|im_start|\u003esystem\\nYou are a helpful assistant.\u003c|im_end|\u003e\\n\u003c|im_start|\u003euser\\n\u003cimage\u003e\\nGive a detailed caption of the video as if I am blind.\u003c|im_end|\u003e\\n\u003c|im_start|\u003eassistant\\n\"\ninput_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors=\"pt\").unsqueeze(0).to(model.device)\nvr = VideoReader(video_path, ctx=cpu(0))\ntotal_frame_num = len(vr)\nuniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)\nframe_idx = uniform_sampled_frames.tolist()\nframes = vr.get_batch(frame_idx).asnumpy()\nvideo_tensor = image_processor.preprocess(frames, return_tensors=\"pt\")[\"pixel_values\"].to(model.device, dtype=torch.float16)\nwith torch.inference_mode():\n    output_ids = model.generate(input_ids, images=[video_tensor],  modalities=[\"video\"], **gen_kwargs)\noutputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()\nprint(outputs)\n```\n\u003c/details\u003e\n\n\n## V-NIAH Evaluation\nTo begin, download a video longer than one hour to use as the haystack video and save it at vision_niah/data/haystack_videos/movie.mp4. We cannot provide the video ourselves as we use an actual movie in our evaluation.\n\nYou can view all needle questions at [lmms-lab/v_niah_needles](https://huggingface.co/datasets/lmms-lab/v_niah_needles).\n```bash\nhuggingface-cli download lmms-lab/LongVA-7B --local-dir vision_niah/model_weights/LongVA-7B\nsh vision_niah/eval.sh\n```\nResults will be saved to vision_niah/niah_output. We run on V-NIAH using PPL-based evaluation. If you want to use generation-based evaluation, check out a demo at vision_niah/eval_vision_niah_sampling.py. Please refer to Section 4 of our paper for more details.\n\n## LMMs-Eval Evaluation\nWe provide both our video and image evaluation pipeline using [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval). After installing `lmms-eval` and longva, you can use the following script to evaluate on both image and video tasks\n\u003cdetails\u003e\n    \u003csummary\u003eImage evaluation command\u003c/summary\u003e\n\n```bash\naccelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \\\n    --model longva \\\n    --model_args pretrained=lmms-lab/LongVA-7B,conv_template=qwen_1_5,model_name=llava_qwen \\\n    --tasks mme \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix mme_longva \\\n    --output_path ./logs/\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n    \u003csummary\u003eVideo evaluation command\u003c/summary\u003e\n\n```bash\naccelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \\\n    --model longva \\\n    --model_args pretrained=lmms-lab/LongVA-7B,conv_template=qwen_1_5,video_decode_backend=decord,max_frames_num=32,model_name=llava_qwen \\\n    --tasks videomme \\\n    --batch_size 1 \\\n    --log_samples \\\n    --log_samples_suffix videomme_longva \\\n    --output_path ./logs/ \n```\n\n\u003c/details\u003e\n\n## Long Text Training\n```bash\nsh text_extend/extend_qwen2.sh\n```\nIt takes around 2 days to train the model on 8 A100 GPUs.\nYou can also download our long-context-pretrained model from huggingface:\n```bash\nhuggingface-cli download lmms-lab/Qwen2-7B-Instrcuct-224K --local-dir text_extend/training_output/Qwen2-7B-Instrcuct-224K\n```\nYou can evaluate the text-niah performance with this command:\n```bash\nsh text_extend/eval.sh\n```\nThe results will be saved to text_extend/niah_output.\n\n## Vision Text Alignment\nPlease refer to [LLaVA-NeXT data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) for data preparation and [longva/scripts](https://github.com/EvolvingLMMs-Lab/LongVA/tree/main/longva/scripts) for training.\n## Citation\n\nIf you find this work useful, please consider citing our paper:\n```\n@article{zhang2024longva,\n  title={Long Context Transfer from Language to Vision},\n  author={Peiyuan Zhang and Kaichen Zhang and Bo Li and Guangtao Zeng and Jingkang Yang and Yuanhan Zhang and Ziyue Wang and Haoran Tan and Chunyuan Li and Ziwei Liu},\n  journal={arXiv preprint arXiv:2406.16852},\n  year={2024},\n  url = {https://arxiv.org/abs/2406.16852}\n}\n```\n\n## Acknowledgement\n- LLaVA: the codebase we built upon. \n- LMMs-Eval: the codebase we used for evaluation. \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Flongva","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevolvinglmms-lab%2Flongva","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Flongva/lists"}