{"id":22619013,"url":"https://github.com/2u1/pixtral-finetune","last_synced_at":"2025-04-11T01:02:08.103Z","repository":{"id":266756373,"uuid":"899262206","full_name":"2U1/Pixtral-Finetune","owner":"2U1","description":"An open-source implementaion for fine-tuning Pixtral by MistralAI.","archived":false,"fork":false,"pushed_at":"2025-02-05T22:32:57.000Z","size":60,"stargazers_count":13,"open_issues_count":3,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-24T21:42:32.846Z","etag":null,"topics":["chatbot","mistral","multimodal","pixtral","vision-language-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/2U1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-05T23:09:23.000Z","updated_at":"2025-03-18T04:45:26.000Z","dependencies_parsed_at":"2024-12-06T00:26:55.764Z","dependency_job_id":"da14150a-7f01-4a2c-95b5-d1a7f0c9f083","html_url":"https://github.com/2U1/Pixtral-Finetune","commit_stats":null,"previous_names":["2u1/pixtral-finetune"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FPixtral-Finetune","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FPixtral-Finetune/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FPixtral-Finetune/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/2U1%2FPixtral-Finetune/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/2U1","download_url":"https://codeload.github.com/2U1/Pixtral-Finetune/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248322600,"owners_count":21084336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatbot","mistral","multimodal","pixtral","vision-language-model"],"created_at":"2024-12-08T21:13:18.518Z","updated_at":"2025-04-11T01:02:08.058Z","avatar_url":"https://github.com/2U1.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fine-tuning Pixtral\n\nThis repository contains a script for training Trnasformers compatible [Pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b).\u003cbr\u003e\n\nHowever the model only supports **batch size=1**. So it could take a long time to fine tune.\n\n## Other projects\n\n**[[Phi3-Vision Finetuning]](https://github.com/2U1/Phi3-Vision-Finetune)**\u003cbr\u003e\n**[[Llama3.2-Vision Finetuning]](https://github.com/2U1/Llama3.2-Vision-Ft)**\u003cbr\u003e\n**[[Qwen2-VL Finetuning]](https://github.com/2U1/Qwen2-VL-Finetune)**\u003cbr\u003e\n**[[Molmo Finetune]](https://github.com/2U1/Molmo-Finetune)**\u003cbr\u003e\n**[[SmolVLM Finetune]](https://github.com/2U1/SmolVLM-Finetune)**\n\n## Update\n\n- [2025/01/24] Add option for using DoRA.\n- [2025/01/24] Fix error in LoRA training.\n- [2025/01/11] Updated 8-bit training using ms_amp fp8 with opt_level O3.\n\n## Table of Contents\n\n- [Fine-tuning Pixtral](#fine-tuning-pixtral)\n  - [Other projects](#other-projects)\n  - [Update](#update)\n  - [Table of Contents](#table-of-contents)\n  - [Supported Features](#supported-features)\n  - [Installation](#installation)\n    - [Using `environment.yaml`](#using-environmentyaml)\n  - [Dataset Preparation](#dataset-preparation)\n  - [Training](#training)\n    - [Full Finetuning](#full-finetuning)\n    - [Full Finetuning with 8-bit](#full-finetuning-with-8-bit)\n    - [Finetune with LoRA](#finetune-with-lora)\n    - [Train with video dataset](#train-with-video-dataset)\n      - [Merge LoRA Weights](#merge-lora-weights)\n      - [Issue for libcudnn error](#issue-for-libcudnn-error)\n  - [TODO](#todo)\n  - [Known Issues](#known-issues)\n  - [License](#license)\n  - [Citation](#citation)\n  - [Acknowledgement](#acknowledgement)\n\n## Supported Features\n\n- Deepspeed\n- LoRA/QLoRA\n- Full-finetuning\n- Enable finetuning `vision_model` while using LoRA.\n- Disable/enable Flash Attention 2\n- Multi-image and video training\n- Training optimized with liger kernel\n\n## Installation\n\nInstall the required packages using `environment.yaml`.\n\n### Using `environment.yaml`\n\n```bash\nconda env create -f environment.yaml\nconda activate pixtral\npip install flash-attn==2.5.8 --no-build-isolation\n```\n\n**Note:** You should install flash-attn after installing the other packages.\n\n## Dataset Preparation\n\nThe script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided `--image_folder`.\u003cbr\u003e\n\n**When using a multi-image dataset, the image tokens should all be `\u003cimage\u003e`, and the image file names should have been in a list.**\n**Please see the example below and follow format your data.**\n\n\u003cdetails\u003e\n\u003csummary\u003eExample for single image dataset\u003c/summary\u003e\n\n```json\n[\n  {\n    \"id\": \"000000033471\",\n    \"image\": \"000000033471.jpg\",\n    \"conversations\": [\n      {\n        \"from\": \"human\",\n        \"value\": \"\u003cimage\u003e\\nWhat are the colors of the bus in the image?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"The bus in the image is white and red.\"\n      },\n      {\n        \"from\": \"human\",\n        \"value\": \"What feature can be seen on the back of the bus?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"The back of the bus features an advertisement.\"\n      },\n      {\n        \"from\": \"human\",\n        \"value\": \"Is the bus driving down the street or pulled off to the side?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"The bus is driving down the street, which is crowded with people and other vehicles.\"\n      }\n    ]\n  }\n  ...\n]\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eExample for multi image dataset\u003c/summary\u003e\n\n```json\n[\n  {\n    \"id\": \"000000033471\",\n    \"image\": [\"000000033471.jpg\", \"000000033472.jpg\"],\n    \"conversations\": [\n      {\n        \"from\": \"human\",\n        \"value\": \"\u003cimage\u003e\\n\u003cimage\u003e\\nIs the perspective of the camera differnt?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"Yes, It the perspective of the camera is different.\"\n      }\n    ]\n  }\n  ...\n]\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eExample for video dataset\u003c/summary\u003e\n\n```json\n[\n  {\n    \"id\": \"sample1\",\n    \"video\": \"sample1.mp4\",\n    \"conversations\": [\n      {\n        \"from\": \"human\",\n        \"value\": \"\u003cvideo\u003e\\nWhat is going on in this video?\"\n      },\n      {\n        \"from\": \"gpt\",\n        \"value\": \"A man is walking down the road.\"\n      }\n    ]\n  }\n  ...\n]\n```\n\n**Note:** Officially pixtral dosen't support the video, but it supports multi-image so you could just use the video as a sequential of frames.\n\n\u003c/details\u003e\n\n## Training\n\nTo run the training script, use the following command:\n\n### Full Finetuning\n\n```bash\nbash scripts/finetune.sh\n```\n\n### Full Finetuning with 8-bit\n\n```bash\nbash scripts/finetune_8bit.sh\n```\n\n**You need to install [ms-amp](https://github.com/Azure/MS-AMP) to use this script.**\u003cbr\u003e\nThis script will finetune the model with fp8 model dtype. If you run out of vram, you could use this.\u003cbr\u003e\nYou could use fp8 training with offloading togegher.\n\n### Finetune with LoRA\n\nIf you want to train only the language model with LoRA and perform full training for the vision model:\n\n```bash\nbash scripts/finetune_lora.sh\n```\n\nIf you want to train both the language model and the vision model with LoRA:\n\n```bash\nbash scripts/finetune_lora_vision.sh\n```\n\n**IMPORTANT:** If you want to tune the `embed_token` with LoRA, You need to tune `lm_head` together.\n\n\u003cdetails\u003e\n\u003csummary\u003eTraining arguments\u003c/summary\u003e\n\n- `--deepspeed` (str): Path to DeepSpeed config file (default: \"scripts/zero2.json\").\n- `--data_path` (str): Path to the LLaVA formatted training data (a JSON file). **(Required)**\n- `--image_folder` (str): Path to the images folder as referenced in the LLaVA formatted training data. **(Required)**\n- `--model_id` (str): Path to the Pixtral model. **(Required)**\n- `--output_dir` (str): Output directory for model checkpoints\n- `--num_train_epochs` (int): Number of training epochs (default: 1).\n- `--per_device_train_batch_size` (int): Training batch size per GPU per forwarding step.\n- `--gradient_accumulation_steps` (int): Gradient accumulation steps (default: 4).\n- `--freeze_vision_tower` (bool): Option to freeze vision_model (default: False).\n- `--freeze_llm` (bool): Option to freeze LLM (default: False).\n- `--tune_merger` (bool): Option to tune projector (default: True).\n- `--num_lora_modules` (int): Number of target modules to add LoRA (-1 means all layers).\n- `--vision_lr` (float): Learning rate for vision_model.\n- `--merger_lr` (float): Learning rate for merger(projector).\n- `--learning_rate` (float): Learning rate for language module.\n- `--max_num_frames` (int): Maxmimum frames for video dataset (default: 10)\n- `--bf16` (bool): Option for using bfloat16.\n- `--fp16` (bool): Option for using fp16.\n- `--min_pixels` (int): Option for minimum input tokens.\n- `--max_pixles` (int): OPtion for maximum maxmimum tokens.\n- `--lora_enable` (bool): Option for enabling LoRA (default: False)\n- `--vision_lora` (bool): Option for including vision_tower to the LoRA module. The `lora_enable` should be `True` to use this option. (default: False)\n- `--use_dora` (bool): Option for using DoRA instead of LoRA. The `lora_enable` should be `True` to use this option. (default: False)\n- `--lora_namespan_exclude` (str): Exclude modules with namespans to add LoRA.\n- `--max_seq_length` (int): Maximum sequence length (default: 32K).\n- `--bits` (int): Quantization bits (default: 16).\n- `--disable_flash_attn2` (bool): Disable Flash Attention 2.\n- `--report_to` (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').\n- `--logging_dir` (str): Logging directory (default: \"./tf-logs\").\n- `--lora_rank` (int): LoRA rank (default: 128).\n- `--lora_alpha` (int): LoRA alpha (default: 256).\n- `--lora_dropout` (float): LoRA dropout (default: 0.05).\n- `--logging_steps` (int): Logging steps (default: 1).\n- `--dataloader_num_workers` (int): Number of data loader workers (default: 4).\n\n**Note:** The learning rate of `vision_model` should be 10x ~ 5x smaller than the `language_model`.\n\n\u003c/details\u003e\n\n### Train with video dataset\n\nYou can train the model using a video dataset. However, officially pixtral dosen't support video. So this code processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.\n\n```bash\nbash scripts/finetune_video.sh\n```\n\n**Note**: You should adjust max_num_frames based on the available VRAM.\n\nIf you run out of vram, you can use [zero3_offload](./scripts/zero3_offload.json) instead of [zero3](./scripts/zero3_offload.json). However, using zero3 is preferred.\n\n#### Merge LoRA Weights\n\n```\nbash scripts/merge_lora.sh\n```\n\n**Note:** Remember to replace the paths in `finetune.sh` or `finetune_lora.sh` with your specific paths. (Also in `merge_lora.sh` when using LoRA.)\n\n#### Issue for libcudnn error\n\n```\nCould not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8\n```\n\nYou could run `unset LD_LIBRARY_PATH` for this error.\nYou could see this [issue](https://github.com/andimarafioti/florence2-finetuning/issues/2)\n\n## TODO\n\n- [ ] Support batch size \u003e 1\n\n## Known Issues\n\n- [libcudnn issue](#issue-for-libcudnn-error)\n\n## License\n\nThis project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you find this repository useful in your project, please consider giving a :star: and citing:\n\n```bibtex\n@misc{Pixtral-Finetuning,\n  author = {Yuwon Lee},\n  title = {Pixtral-Finetune},\n  year = {2024},\n  publisher = {GitHub},\n  url = {https://github.com/2U1/Pixtral-Finetune}\n}\n```\n\n## Acknowledgement\n\nThis project is based on\n\n- [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT): An amazing open-source project of LMM.\n- [Pixtral-12B](https://huggingface.co/mistral-community/pixtral-12b): Transformer compatible version of pixtral-12b\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2u1%2Fpixtral-finetune","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F2u1%2Fpixtral-finetune","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F2u1%2Fpixtral-finetune/lists"}