{"id":17637866,"url":"https://github.com/2u1/llama3.2-vision-finetune","last_synced_at":"2025-04-05T05:03:05.172Z","repository":{"id":258887234,"uuid":"863322251","full_name":"2U1/Llama3.2-Vision-Finetune","owner":"2U1","description":"An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.","archived":false,"fork":false,"pushed_at":"2025-03-31T02:27:41.000Z","size":76,"stargazers_count":146,"open_issues_count":13,"forks_count":21,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-05T05:02:55.850Z","etag":null,"topics":["llama3","multi-modal","vision-language","vision-language-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/2U1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-26T05:21:19.000Z","updated_at":"2025-04-03T05:01:48.000Z","dependencies_parsed_at":"2024-11-15T09:04:54.505Z","dependency_job_id":"d5f71ce7-fe2f-44bf-92dc-e27cb65049f3","html_url":"https://github.com/2U1/Llama3.2-Vision-Finetune","commit_stats":{"total_commits":25,"total_committers":3,"mean_commits":8.333333333333334,"dds":0.07999999999999996,"last_synced_commit":"9c6821e95a6e962600ecc654b7e545d4a3dd316a"},"previous_names":["2u1/llama3.2-vision-finetune"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FLlama3.2-Vision-Finetune","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FLlama3.2-Vision-Finetune/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FLlama3.2-Vision-Finetune/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FLlama3.2-Vision-Finetune/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/2U1","download_url":"https://codeload.github.com/2U1/Llama3.2-Vision-Finetune/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247289409,"owners_count":20914464,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llama3","multi-modal","vision-language","vision-language-model"],"created_at":"2024-10-23T03:06:28.575Z","updated_at":"2025-04-05T05:03:05.142Z","avatar_url":"https://github.com/2U1.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fine-tuning Llama3.2-Vision\n\nThis repository contains a script for training [Llama3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) with only using HuggingFace and [Liger-Kernel](https://github.com/linkedin/Liger-Kernel).\n\n## Other projects\n\n**[[Phi3-Vision Finetuning]](https://github.com/2U1/Phi3-Vision-Finetune)**\u003cbr\u003e\n**[[Qwen2-VL Finetuning]](https://github.com/2U1/Qwen2-VL-Finetune)**\u003cbr\u003e\n**[[Molmo Finetuning]](https://github.com/2U1/Molmo-Finetune)**\u003cbr\u003e\n**[[Pixtral Finetune]](https://github.com/2U1/Pixtral-Finetune)**\u003cbr\u003e\n**[[SmolVLM Finetune]](https://github.com/2U1/SmolVLM-Finetune)**\u003cbr\u003e\n**[[Gemma3 Finetune]](https://github.com/2U1/Gemma3-Finetune)**\n\n## Update\n\n- [2025/01/24] Add option for using DoRA.\n- [2025/01/24] Fix error in LoRA training.\n- [2025/01/18] 🔥Supports mixed-modality data.\n- [2025/01/11] Updated 8-bit training using ms_amp fp8 with opt_level O3.\n- [2024/11/05] Add memory efficient 8-bit training.\n- [2024/11/05] 🔥Supports training with liger-kernel.\n- [2024/10/04] 🔥Supports text-only data.\n\n## Table of Contents\n\n- [Fine-tuning Llama3.2-Vision](#fine-tuning-llama32-vision)\n  - [Other projects](#other-projects)\n  - [Update](#update)\n  - [Table of Contents](#table-of-contents)\n  - [Supported Features](#supported-features)\n  - [Docker](#docker)\n  - [Installation](#installation)\n    - [Environments](#environments)\n    - [Using `environment.yaml`](#using-environmentyaml)\n  - [Dataset Preparation](#dataset-preparation)\n  - [Training](#training)\n    - [Full Finetuning](#full-finetuning)\n    - [Full Finetuning with 8-bit](#full-finetuning-with-8-bit)\n    - [Finetune with LoRA](#finetune-with-lora)\n    - [Train with video dataset](#train-with-video-dataset)\n      - [Merge LoRA Weights](#merge-lora-weights)\n      - [Issue for libcudnn error](#issue-for-libcudnn-error)\n  - [TODO](#todo)\n  - [Known Issues](#known-issues)\n  - [License](#license)\n  - [Citation](#citation)\n  - [Acknowledgement](#acknowledgement)\n\n## Supported Features\n\n- Deepspeed\n- LoRA, QLoRA\n- Full-finetuning\n- Multi-image and video training\n\n## Docker\n\nTo simplfy the setting process for training, you could use the provided pre-build environments.\u003cbr\u003e\nThe settings are done in the conda env named `train`.\u003cbr\u003e\u003cbr\u003e\nYou could find more information about the image [here](https://hub.docker.com/repository/docker/john119/vlm/general).\n\n```\ndocker pull john119/vlm:v1\ndocker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm:v1 /bin/bash\n```\n\n## Installation\n\n### Environments\n\n- Ubuntu 22.04\n- Nvidia-Driver 550.120\n- Cuda version 12.4\n\nInstall the required packages using `environment.yml`.\n\n### Using `environment.yaml`\n\n```bash\nconda env create -f environment.yaml\nconda activate llama\n```\n\n**Note:** Llama3.2-Vision does not support flash-attention2 for now.\n\n## Dataset Preparation\n\nThe script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided `--image_folder`.\u003cbr\u003e\n\n**When using a multi-image dataset, the image tokens should all be `\u003cimage\u003e`, and the image file names should have been in a list.**\n**Please see the example below and follow format your data.**\n\n\u003cdetails\u003e\n\u003csummary\u003eExample for single image dataset\u003c/summary\u003e\n\n```json\n[\n  {\n    \"id\": \"000000033471\",\n    \"image\": \"000000033471.jpg\",\n    \"conversations\": [\n      {\n        \"from\": \"human\",\n        \"value\": \"\u003cimage\u003e\\nWhat are the colors of the bus in the image?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"The bus in the image is white and red.\"\n      },\n      {\n        \"from\": \"human\",\n        \"value\": \"What feature can be seen on the back of the bus?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"The back of the bus features an advertisement.\"\n      },\n      {\n        \"from\": \"human\",\n        \"value\": \"Is the bus driving down the street or pulled off to the side?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"The bus is driving down the street, which is crowded with people and other vehicles.\"\n      }\n    ]\n  }\n  ...\n]\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eExample for multi image dataset\u003c/summary\u003e\n\n```json\n[\n  {\n    \"id\": \"000000033471\",\n    \"image\": [\"000000033471.jpg\", \"000000033472.jpg\"],\n    \"conversations\": [\n      {\n        \"from\": \"human\",\n        \"value\": \"\u003cimage\u003e\\n\u003cimage\u003e\\nIs the perspective of the camera differnt?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"Yes, It the perspective of the camera is different.\"\n      }\n    ]\n  }\n  ...\n]\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eExample for video dataset\u003c/summary\u003e\n\n```json\n[\n  {\n    \"id\": \"sample1\",\n    \"video\": \"sample1.mp4\",\n    \"conversations\": [\n      {\n        \"from\": \"human\",\n        \"value\": \"\u003cvideo\u003e\\nWhat is going on in this video?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"A man is walking down the road.\"\n      }\n    ]\n  }\n  ...\n]\n```\n\n**Note:** Llama3.2-Vision uses a video as a sequential of images.\n\n\u003c/details\u003e\n\n## Training\n\n**Note:** Deepspeed zero2 is faster than zero3, however it consumes more memory. Also, most of the time zero2 is more stable than zero3.\u003cbr\u003e\u003cbr\u003e\n**Tip:** You could use `adamw_bnb_8bit` for optimizer to save memory.\n\nTo run the training script, use the following command:\n\n### Full Finetuning\n\n```bash\nbash scripts/finetune.sh\n```\n\n### Full Finetuning with 8-bit\n\n```bash\nbash scripts/finetune_8bit.sh\n```\n\n**You need to install [ms-amp](https://github.com/Azure/MS-AMP) to use this script.**\u003cbr\u003e\nThis script will finetune the model with fp8 model dtype. If you run out of vram, you could use this.\u003cbr\u003e\nYou could combine fp8 training with offloading.\n\n### Finetune with LoRA\n\nIf you want to train only the language model with LoRA and perform full training for the vision model:\n\n```bash\nbash scripts/finetune_lora.sh\n```\n\nIf you want to train both the language model and the vision model with LoRA:\n\n```bash\nbash scripts/finetune_lora_vision.sh\n```\n\n**IMPORTANT:** If you want to tune the `embed_token` with LoRA, You need to tune `lm_head` together.\n\n\u003cdetails\u003e\n\u003csummary\u003eTraining arguments\u003c/summary\u003e\n\n- `--deepspeed` (str): Path to DeepSpeed config file (default: \"scripts/zero2.json\").\n- `--data_path` (str): Path to the LLaVA formatted training data (a JSON file). **(Required)**\n- `--image_folder` (str): Path to the images folder as referenced in the LLaVA formatted training data. **(Required)**\n- `--model_id` (str): Path to the Llama3.2-Vision model. **(Required)**\n- `--optim` (str): Optimizer when training (default: `adamw_torch`).\n- `--output_dir` (str): Output directory for model checkpoints\n- `--num_train_epochs` (int): Number of training epochs (default: 1).\n- `--per_device_train_batch_size` (int): Training batch size per GPU per forwarding step.\n- `--gradient_accumulation_steps` (int): Gradient accumulation steps (default: 4).\n- `--freeze_vision_tower` (bool): Option to freeze vision_model (default: False).\n- `--tune_merger` (bool): Option to tune projector (default: True).\n- `--num_lora_modules` (int): Number of target modules to add LoRA (-1 means all layers).\n- `--vision_lr` (float): Learning rate for vision_model.\n- `--projector_lr` (float): Learning rate for projector.\n- `--learning_rate` (float): Learning rate for language module.\n- `--bf16` (bool): Option for using bfloat16.\n- `--fp16` (bool): Option for using fp16.\n- `--lora_enable` (bool): Option for enabling LoRA (default: False)\n- `--vision_lora` (bool): Option for including vision_tower to the LoRA module. The `lora_enable` should be `True` to use this option. (default: False)\n- `--use_dora` (bool): Option for using DoRA instead of LoRA. The `lora_enable` should be `True` to use this option. (default: False)\n- `--lora_namespan_exclude` (str): Exclude modules with namespans to add LoRA.\n- `--max_seq_length` (int): Maximum sequence length (default: 128K).\n- `--bits` (int): Quantization bits (default: 16).\n- `--disable_flash_attn2` (bool): Disable Flash Attention 2.\n- `--report_to` (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').\n- `--logging_dir` (str): Logging directory (default: \"./tf-logs\").\n- `--lora_rank` (int): LoRA rank (default: 128).\n- `--lora_alpha` (int): LoRA alpha (default: 256).\n- `--lora_dropout` (float): LoRA dropout (default: 0.05).\n- `--logging_steps` (int): Logging steps (default: 1).\n- `--dataloader_num_workers` (int): Number of data loader workers (default: 4).\n\n**Note:** The learning rate of `vision_model` should be 10x ~ 5x smaller than the `language_model`.\n\n\u003c/details\u003e\n\n### Train with video dataset\n\nYou can train the model using a video dataset. However, Llama3.2-Vision processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.\n\n```bash\nbash scripts/finetune_video.sh\n```\n\nIf you run out of vram, you can use [zero3_offload](./scripts/zero3_offload.json) instead of [zero3](./scripts/zero3_offload.json). However, using zero3 is preferred.\n\n#### Merge LoRA Weights\n\n```\nbash scripts/merge_lora.sh\n```\n\n**Note:** Remember to replace the paths in `finetune.sh` or `finetune_lora.sh` with your specific paths. (Also in `merge_lora.sh` when using LoRA.)\n\n#### Issue for libcudnn error\n\n```\nCould not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8\n```\n\nYou could run `unset LD_LIBRARY_PATH` for this error.\nYou could see this [issue](https://github.com/andimarafioti/florence2-finetuning/issues/2)\n\n## TODO\n\n- [x] Support for multi-image \u0026 video data\n- [x] Support for batch_size \u003e 1\n- [x] Handle mixed-modality data\n\n## Known Issues\n\n- [libcudnn issue](#issue-for-libcudnn-error)\n\n## License\n\nThis project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you find this repository useful in your project, please consider giving a :star: and citing:\n\n```bibtex\n@misc{Llama3.2-Vision-Finetuning,\n  author = {Yuwon Lee},\n  title = {Llama3.2-Vision-Finetune},\n  year = {2024},\n  publisher = {GitHub},\n  url = {https://github.com/2U1/Llama3.2-Vision-Ft}\n}\n```\n\n## Acknowledgement\n\nThis project is based on\n\n- [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT): An amazing open-source project of LMM.\n- [Llama3.2-Vision](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct): Awesome pretrained MLLM by Meta.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2u1%2Fllama3.2-vision-finetune","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F2u1%2Fllama3.2-vision-finetune","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2u1%2Fllama3.2-vision-finetune/lists"}