{"id":16272059,"url":"https://github.com/sshh12/multi_token","last_synced_at":"2025-05-07T20:31:03.381Z","repository":{"id":199837762,"uuid":"703289461","full_name":"sshh12/multi_token","owner":"sshh12","description":"Embed arbitrary modalities (images, audio, documents, etc) into large language models.","archived":false,"fork":false,"pushed_at":"2024-03-27T03:15:12.000Z","size":1282,"stargazers_count":175,"open_issues_count":12,"forks_count":12,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-11T18:16:12.802Z","etag":null,"topics":["large-context","large-language-models","large-multimodal-models","llava","llm","multi-modality","multimodal","vision-language-model"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sshh12.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-10-11T01:03:46.000Z","updated_at":"2024-10-02T00:50:15.000Z","dependencies_parsed_at":"2023-12-15T06:52:18.964Z","dependency_job_id":null,"html_url":"https://github.com/sshh12/multi_token","commit_stats":null,"previous_names":["sshh12/lmm_multi_token","sshh12/multi_token"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sshh12%2Fmulti_token","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sshh12%2Fmulti_token/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sshh12%2Fmulti_token/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sshh12%2Fmulti_token/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sshh12","download_url":"https://codeload.github.com/sshh12/multi_token/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221663983,"owners_count":16859948,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["large-context","large-language-models","large-multimodal-models","llava","llm","multi-modality","multimodal","vision-language-model"],"created_at":"2024-10-10T18:15:58.488Z","updated_at":"2024-10-27T10:38:06.004Z","avatar_url":"https://github.com/sshh12.png","language":"Python","readme":"# multi_token\n\n\u003e Embed arbitrary modalities (images, audio, documents, etc) into large language models.\n\nThis library is designed to be an extension of LLaVA for encoding ✨anything✨ (images, sounds, documents, videos, motion capture, screenshots, voice recordings, ...) into a format that can used in large language models. Its primary contribution is the ability to embed multiple instances and modalities into a single model and a framework for doing so fairly easily.\n\nPotentially with this you could ask Large Multimodal Models (LMMs):\n\n- \u003e Read \\\u003cdocument\\\u003e and give me a summary.\n\n- \u003e Listen to \\\u003caudio\\\u003e and answer the spoke question.\n\n- \u003e Compare and contrast \\\u003cimage\\\u003e and \\\u003cimage\\\u003e\n\n- \u003e Given \\\u003cscreenshot\\\u003e and \\\u003cgame-state\\\u003e, what key should I press?\n\nInterested in how this works? See this [blog post](https://blog.sshh.io/p/large-multimodal-models-lmms).\n\n## Usage\n\n```bash\ngit clone https://github.com/sshh12/multi_token \\\n        \u0026\u0026 cd multi_token \\\n        \u0026\u0026 pip install -r requirements.txt \\\n        \u0026\u0026 pip install -e .\n\npip install flash-attn --no-build-isolation\n```\n\n### Model Zoo\n\n#### ⚠️ If you run into a missing `adapters.bin` see https://github.com/sshh12/multi_token/issues/12. ⚠️\n\n| Base Model                                                | Model | Modality | Notes |\n| - | - | - | - |\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-DocumentGTE-16K-x8](https://huggingface.co/sshh12/Mistral-7B-LoRA-DocumentGTE-16K-x8) | **Long Document** \u003cbr/\u003e \u003cbr/\u003e Encode a document as a series of `\u003cdocument\u003e` and with `documents`. | ⚠️📚 A compression model pretrained on wikipedia and finetuned on LongAlpaca and Long-Data-Collections. Compresses chunks of 512 tokens into 64 using [gte-large](https://huggingface.co/thenlper/gte-large), as expected the results are fairly lossy. It performs similarly to the x128 version suggesting the bottleneck is the embedding model itself. \u003cbr/\u003e\u003cbr/\u003e Compute: ~100 A6000 hours|\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-DocumentGTE-260K-x128](https://huggingface.co/sshh12/Mistral-7B-LoRA-DocumentGTE-260K-x128) | **Long Document** \u003cbr/\u003e \u003cbr/\u003e Encode a document as a series of `\u003cdocument\u003e` and with `documents`. | ⚠️📚 A compression model pretrained on wikipedia and finetuned on LongAlpaca and Long-Data-Collections. Compresses chunks of 512 tokens into only 4 using [gte-large](https://huggingface.co/thenlper/gte-large), as expected the results are fairly lossy. \u003cbr/\u003e\u003cbr/\u003e Compute: ~50 A6000 hours|\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-ImageBind-LLAVA](https://huggingface.co/sshh12/Mistral-7B-LoRA-ImageBind-LLAVA) | **ImageBind (Vision/Audio/Text)** \u003cbr/\u003e \u003cbr/\u003e Encode audio or image filenames as `\u003cimagebind\u003e` and with `imagebinds`. | ⚠️🖼️🔊📚 A model pretrained and finetuned on an augmented LLaVA dataset. Might hallucinate colors from audio and needs explicit mention of if the input is a sound/image/document. \u003cbr/\u003e\u003cbr/\u003e Compute: ~180 4090 hours|\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-VisionCLIP-LLAVA](https://huggingface.co/sshh12/Mistral-7B-LoRA-VisionCLIP-LLAVA) | **Vision** \u003cbr/\u003e \u003cbr/\u003e Encode images as `\u003cimage\u003e` and with `images`. | ⭐🖼️ A model pretrained and finetuned on the LLaVA dataset. This should be comparable to [BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA) and [LLaVA 1.5](https://llava-vl.github.io/). \u003cbr/\u003e\u003cbr/\u003e Compute: ~160 3090 Ti hours|\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-VisionCLIPPool-LLAVA](https://huggingface.co/sshh12/Mistral-7B-LoRA-VisionCLIPPool-LLAVA) | **Vision** \u003cbr/\u003e \u003cbr/\u003e Encode images as `\u003cimage\u003e` and with `images`. | ⭐🖼️ A model pretrained and finetuned on the LLaVA dataset. This should be comparable to [BakLLaVA](https://github.com/SkunkworksAI/BakLLaVA) and [LLaVA 1.5](https://llava-vl.github.io/). Uses the last layer of CLIP encoded as 10-tokens (rather than the orignal 576). \u003cbr/\u003e\u003cbr/\u003e Compute: ~100 A6000 hours|\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-Multi-VisionCLIPPool-LLAVA](https://huggingface.co/sshh12/Mistral-7B-LoRA-Multi-VisionCLIPPool-LLAVA) | **Vision** \u003cbr/\u003e \u003cbr/\u003e Encode images as `\u003cimage\u003e\u003cimage\u003e...` and with `images`. | ⭐🖼️🖼️ A model pretrained and finetuned on the LLaVA dataset and a synthetic multi-image dataset. Images encoded as 10-tokens each and this should support up to 6 images. \u003cbr/\u003e\u003cbr/\u003e Compute: ~100 A6000 hours|\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-CLIP-LoRA-captions-only-demo](https://huggingface.co/sshh12/Mistral-7B-CLIP-LoRA-captions-only-demo) | **Vision** \u003cbr/\u003e \u003cbr/\u003e Encode images as `\u003cimage\u003e` and with `images`. | ⚠️🖼️ This is a __very limited__ image model trained on only a few __caption-only__ examples for the sake of demonstrating a proof of concept. \u003cbr/\u003e\u003cbr/\u003e Compute: ~10 3090 Ti hours |\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-XCLIP](https://huggingface.co/sshh12/Mistral-7B-LoRA-XCLIP) | **Video** \u003cbr/\u003e \u003cbr/\u003e Encode videos as `\u003cvideo\u003e` and with `videos`. | ⚠️🎥 This is a __very limited__ video model. Hard to find good video caption datasets so this model is very undertrained. \u003cbr/\u003e\u003cbr/\u003e Compute: ~50 A6000 hours |\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-AudioWhisper](https://huggingface.co/sshh12/Mistral-7B-LoRA-AudioWhisper) | **Audio (Speech)** \u003cbr/\u003e \u003cbr/\u003e Encode images as `\u003cspeech\u003e` and with `speech_audios`. | ⚠️🔊 A model pretrained on commonvoice and finetuned on a GPT3.5 synthetic dataset. This pretty undertrained and isn't that great (also based on whisper-small) but it kind of works. \u003cbr/\u003e\u003cbr/\u003e Compute: ~60 A6000 hours|\n| [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | [sshh12/Mistral-7B-LoRA-AudioCLAP](https://huggingface.co/sshh12/Mistral-7B-LoRA-AudioCLAP) | **Audio (Sound)** \u003cbr/\u003e \u003cbr/\u003e Encode images as `\u003csound\u003e` and with `sounds`. | ⚠️🔊 A model pretrained on `Chr0my/Epidemic_sounds` and finetuned on a GPT3.5 synthetic dataset. This pretty undertrained and but seems OK. \u003cbr/\u003e\u003cbr/\u003e Compute: ~30 A6000 hours|\n\n⭐ = Useable, ⚠️ = Proof of concept, experimental\n\n### Vision\n\n##### LLaVA-equivalent\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path sshh12/Mistral-7B-LoRA-VisionCLIP-LLAVA \\\n    --load_bits 4 \\\n    --port 7860\n```\n\n```python\nrequests.post(\n    \"http://localhost:7860/generate\",\n    json={\n        \"messages\": [{\"role\": \"user\", \"content\": \"What are things I should be cautious about when I visit this place? \u003cimage\u003e\"}],\n        \"images\": [\"https://github.com/sshh12/multi_token/raw/main/.demo/llava-view.jpg\"],\n    },\n).json()\n# {'output': 'When visiting this place, which is a lake with a wooden dock, there are a few things to be cautious about. First, be aware of the water depth and the presence of any hidden obstacles, such as rocks or underwater debris, that could pose a risk to your safety. Second, be mindful of the weather conditions, as sudden changes in weather can make the water unpredictable and potentially dangerous. Lastly, be cautious of any wildlife or marine life in the area, as they may pose a threat to your safety or cause damage to the dock.'}\n```\n\n##### Multi Image\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path sshh12/Mistral-7B-LoRA-Multi-VisionCLIPPool-LLAVA \\\n    --port 7860\n```\n\n```python\nrequests.post(\n    \"http://localhost:7860/generate\",\n    json={\n        \"messages\": [{\"role\": \"user\", \"content\": \"\u003cimage\u003e\u003cimage\u003e What is the difference in color between the images?\"}],\n        \"images\": [\"https://github.com/sshh12/multi_token/raw/main/.demo/wiki-pink-flower.jpg\", \"https://github.com/sshh12/multi_token/raw/main/.demo/wiki-yellow-flower.jpg\"],\n    },\n).json()\n# {'output': 'The first image has a pink flower, while the second image has yellow flowers.'}\n```\n\n### Speech\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path sshh12/Mistral-7B-LoRA-AudioWhisper \\\n    --port 7860\n```\n\n```python\nrequests.post(\n    \"http://localhost:7860/generate\",\n    json={\n        \"messages\": [{\"role\": \"user\", \"content\": \"What is being said? \u003cspeech\u003e\"}],\n        \"speech_audios\": [\"https://github.com/sshh12/multi_token/raw/main/.demo/test.mp3\"],\n    },\n).json()\n# {'output': 'This is a test.'}\n```\n\n### Sound\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path sshh12/Mistral-7B-LoRA-AudioCLAP \\\n    --port 7860\n```\n\n```python\nrequests.post(\n    \"http://localhost:7860/generate\",\n    json={\n        \"messages\": [{\"role\": \"user\", \"content\": \"What is making this sound? \u003csound\u003e\"}],\n        \"sounds\": [\"https://github.com/sshh12/multi_token/raw/main/.demo/imagebind-dog-audio.wav\"],\n    },\n).json()\n# {'output': 'The sound is being made by a chihuahua barking.'}\n```\n\n### Video\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path sshh12/Mistral-7B-LoRA-XCLIP \\\n    --port 7860\n```\n\n```python\nrequests.post(\n    \"http://localhost:7860/generate\",\n    json={\n        \"messages\": [{\"role\": \"user\", \"content\": \"\u003cvideo\u003e What instrument is shown in the video?\"}],\n        \"videos\": [\"https://www.youtube.com/watch?v=3569sBBgVsc\"],\n    },\n).json()\n# {'output': 'a man is playing the piano in a room'}\n```\n\n### ImageBind (Vision/Audio/Text)\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path sshh12/Mistral-7B-LoRA-ImageBind-LLAVA \\\n    --port 7860\n```\n\n```python\nrequests.post(\n    \"http://localhost:7860/generate\",\n    json={\n        \"messages\": [{\"role\": \"user\", \"content\": \"\u003cimagebind\u003e What is the animal in this sound?\"}],\n        \"imagebinds\": [\"https://github.com/sshh12/multi_token/raw/main/.demo/imagebind-dog-audio.wav\"],\n    },\n).json()\n# {'output': 'The animal in this sound is a dog.'}\n```\n\n### Long Documents\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path sshh12/Mistral-7B-LoRA-DocumentGTE-260K-x128 \\\n    --port 7860\n```\n\n```python\nfrom multi_token.modalities.document_gte import (\n    split_text_into_documents,\n)\n\nwith open(\".demo/llava-paper.txt\", \"r\") as f:\n    docs = split_text_into_documents(f.read())\n\nrequests.post(\n    \"http://localhost:7860/generate\",\n    json={\n        \"messages\": [{\"role\": \"user\", \"content\": \"Read the paper \" + \"\u003cdocument\u003e\" * len(docs) + \". Give me a summary.\"}],\n        \"documents\": docs,\n    },\n).json()\n# {'output': 'Here is a summary of the key points from the paper:\\n\\n- The paper proposes a new dataset called LAML, which contains 100,000 image-text pairs with 100 different languages. The dataset aims to provide a large-scale resource for training multilingual vision-language models.\\n\\n- The authors find that existing multilingual vision-language models struggle to generate high-quality captions for images in languages they have not seen before. This is because the models lack the ability to generate language-specific knowledge...'}\n```\n\n## Training\n\n### Add a Modality\n\nYou can do this by implementing an instance of `multi_token.modalities.base_modality.Modality` (see [CLIP for vision example](https://github.com/sshh12/multi_token/blob/main/multi_token/modalities/vision_clip.py)).\n\n\u003cdetails\u003e\n\u003csummary\u003eSee annotated example\u003c/summary\u003e\n\n```python\nclass MyModality(Modality):\n    def __init__(\n        self,\n    ):\n        # ...\n\n    def build_projector(self, lm_hidden_size: int) -\u003e nn.Module:\n        # a pytorch module that converts a preprocessed item (after `forward`) into a tensor `(batch size x token width x lm_hidden_size)`\n\n    @property\n    def name(self) -\u003e str:\n        # the name/ID for this modality\n        return \"my_modality\"\n\n    @property\n    def token(self) -\u003e str:\n        # the token you'll use in text to represent this\n        return \"\u003cmy-modality\u003e\"\n\n    @property\n    def data_key(self) -\u003e str:\n        # the key in your dataset rows for raw instances of this\n        return \"my_modality_items\"\n\n    @property\n    def token_width(self) -\u003e int:\n        # how many tokens should we use to present instances of this?\n        # too small and it's not descriptive enough, too large and you are using up the context window\n        return 1\n\n    def preprocess_rows(self, row: List[Dict]) -\u003e List[Optional[Any]]:\n        # convert raw dataset rows into an arbitrary tensor to pass to `forward`\n\n    @torch.no_grad()\n    def forward(self, encoded_values: List[Any]) -\u003e List[torch.Tensor]:\n        # encode `preprocess_rows` output values into the format that will be fed into the projector\n```\n\n\u003c/details\u003e\n\nRegister this new modality by adding it to `multi_token.modalities.MODALITY_BUILDERS`.\n\n```python\nMODALITY_BUILDERS = {\n    ...,\n    \"my_modality\": lambda: [MyModality()],\n}\n```\n\n### Dataset\n\nYou can see some of the existing [scripts](https://github.com/sshh12/multi_token/tree/main/scripts) for putting things into the correct dataset format.\n\nSchema:\n```javascript\n// LLaVA/CLIP example\n{\n    \"id\": \"arbitrary-id-123\",\n    \"images\": [\"/path/to/image.png\"],\n    \"messages\": [{\"role\": \"user\", \"content\": \"Describe \u003cimage\u003e\"}, {\"role\": \"assistant\", \"content\": \"This is a potato.\"}],\n}\n\n// Custom\n{\n    \"id\": \"arbitrary-id-123\",\n    \"my_modality_items\": [\"/path/to/data OR just the full document\"],\n    \"messages\": [{\"role\": \"user\", \"content\": \"Describe \u003cmy-modality\u003e\"}, {\"role\": \"assistant\", \"content\": \"This is ...\"}],\n}\n```\n\nThen save with `dataset.save_to_disk(output_folder)`.\n\n### Pretraining\n\nUse this command with standard huggingface training arguments:\n\n```\ndeepspeed scripts/train_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_cls MistralLMMForCausalLM \\\n    --modality_builder vision_clip \\\n    --dataset_path /data/llava-chat-captions \\\n    --output_dir /data/output/my_lmm_pretrain \\\n    --pretrain_projectors \\\n    --lora_enable True \\\n    --bf16 True \\\n    --tf32 True \\\n    --num_train_epochs 1 \\\n    --gradient_checkpointing True \\\n    --per_device_train_batch_size 1 \\\n    --per_device_eval_batch_size 1 \\\n    --gradient_accumulation_steps 32 \\\n    --model_max_length 2048 \\\n    --evaluation_strategy \"no\" \\\n    --save_strategy \"steps\" \\\n    --save_steps 1000 \\\n    --save_total_limit 1 \\\n    --learning_rate 1e-5 \\\n    --weight_decay 0. \\\n    --warmup_ratio 0.03 \\\n    --lr_scheduler_type \"cosine\" \\\n    --dataloader_num_workers 2 \\\n    --logging_steps 1 \\\n    --report_to wandb \\\n    --deepspeed ./configs/zero2.json\n```\n\nThe key arguments are:\n* `--modality_builder`: the name of the modality builder to use (see `MODALITY_BUILDERS`)\n* `--pretrain_projectors`: freeze the language model and only train the projectors\n* `--model_cls`: the model class to use (this should match your base model)\n\n### Finetuning\n\nUse this command with standard huggingface training arguments:\n\n```\ndeepspeed scripts/train_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_cls MistralLMMForCausalLM \\\n    --modality_builder vision_clip \\\n    --pretrained_projectors_path /data/output/my_lmm_pretrain/checkpoint-4000/non_lora_trainables.bin \\\n    --dataset_path /data/llava-chat-captions \\\n    --output_dir /data/output/my_lmm_pretrain \\\n    --pretrain_projectors \\\n    --lora_enable True \\\n    --bf16 True \\\n    --tf32 True \\\n    --num_train_epochs 1 \\\n    --gradient_checkpointing True \\\n    --per_device_train_batch_size 1 \\\n    --per_device_eval_batch_size 1 \\\n    --gradient_accumulation_steps 32 \\\n    --model_max_length 2048 \\\n    --evaluation_strategy \"no\" \\\n    --save_strategy \"steps\" \\\n    --save_steps 1000 \\\n    --save_total_limit 1 \\\n    --learning_rate 1e-5 \\\n    --weight_decay 0. \\\n    --warmup_ratio 0.03 \\\n    --lr_scheduler_type \"cosine\" \\\n    --dataloader_num_workers 2 \\\n    --logging_steps 1 \\\n    --report_to wandb \\\n    --deepspeed ./configs/zero2.json\n```\n\nThe key arguments are:\n* `--modality_builder`: the name of the modality builder to use (see `MODALITY_BUILDERS`)\n* `--pretrained_projectors_path`: the path to the pretrained projectors (from the pretraining step)\n* `--model_cls`: the model class to use (this should match your base model)\n\nYou can also omit `pretrained_projectors_path` to just train the full model from scratch. According to the LLaVA paper, this is not as good as training the projectors first (but it will work).\n\n### Inference\n\nUse the following to run a local flask server for inference:\n\n```\npython scripts/serve_model.py \\\n    --model_name_or_path mistralai/Mistral-7B-Instruct-v0.1 \\\n    --model_lora_path /data/output/lmm_just_trained_folder \\\n    --port 7860\n```\n\nYou can use this utility to upload your model to huggingface:\n\n```\npython scripts/upload_model.py \\\n    -r username/my-new-lmm \\\n    -m /data/output/lmm_just_trained_folder\n```\n\n## Comparision to LLaVA\n\n\u003e LLaVA: Large Language and Vision Assistant\n\u003e\n\u003e [[Project Page](https://llava-vl.github.io/)] [[Demo](https://llava.hliu.cc/)]  [[Data](https://github.com/haotian-liu/LLaVA/blob/main/docs/Data.md)] [[Model Zoo](https://github.com/haotian-liu/LLaVA/blob/main/docs/MODEL_ZOO.md)]\n\u003e \n\u003e **Improved Baselines with Visual Instruction Tuning** [[Paper](https://arxiv.org/abs/2310.03744)] \u003cbr\u003e\n\u003e [Haotian Liu](https://hliu.cc), [Chunyuan Li](https://chunyuan.li/), [Yuheng Li](https://yuheng-li.github.io/), [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/)\n\u003e \n\u003e **Visual Instruction Tuning** (NeurIPS 2023, **Oral**) [[Paper](https://arxiv.org/abs/2304.08485)]\u003cbr\u003e\n\u003e [Haotian Liu*](https://hliu.cc), [Chunyuan Li*](https://chunyuan.li/), [Qingyang Wu](https://scholar.google.ca/citations?user=HDiw-TsAAAAJ\u0026hl=en/), [Yong Jae Lee](https://pages.cs.wisc.edu/~yongjaelee/) (*Equal Contribution)\n\nThe inspiration and much of the source code for this project comes from the original [LLaVA](https://github.com/haotian-liu/LLaVA/) implementation (apache 2.0). \n\n### Core Differences\n\n* This library is designed to be more modular for adding custom encoders/projectors. In some areas, the LLaVA implementation was simplified (e.g. stripped out a lot of the eval, preprocessing code, and non-LLAMA parts) and in others more complex (handling multiple types of modalities).\n* The tokenization and injection of projected encodings into the language model's token space are written from scratch, but, _in theory_, do the exact same thing. I like to think this project's `prepare_inputs_labels_for_multimodal` is a bit easier to grok and manipulate than the original.\n* You can use multiple instances of tokens from the same or different modalities (where as LLaVA was only for a single image). For example, `Given \u003cimage\u003e and \u003cimage\u003e, answer the question asked in \u003caudio\u003e`. \n\nIf one were to train a model using this library with the same base model and projection config as LLaVA-1.5, I would expect nearly identical performance (barring any bugs in this implementation).\n\n## TODOs\n\n* Multi-GPU support\n* Full (non-LoRA training)\n* Training quantization (QLoRA)\n* Efficient batch preprocessing\n* Efficient batch projection\n* Efficient batch collation (based on example lengths)\n* Efficient batch inference\n* Allow for non-`INST` based instruction formats and system tokens\n* Support more base language models\n\n## Development\n\n### Windows Docker Dev\n\nMy local dev setup is Windows + WSL + Docker + 3090 Ti (24GB VRAM). `F:/` is configured to be a large data drive that I share among containers.\n\n1. `docker build -t multi-token-dev .`\n2. `docker run -it --gpus all -p 7860:7860 --mount type=bind,source=F:/docker-hf-cache,target=/root/.cache/huggingface --mount type=bind,source=F:/docker-data,target=/data --name multi-token-dev multi-token-dev`\n\n### Vast.ai Dev\n\nFor some models, I'm using cheapish GPU instances on [vast.ai](https://cloud.vast.ai/).\n\n1. `vastai create instance $ID --image pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel --disk 512`\n2. `ssh -p $PORT root@$HOST`\n3. `curl -o- https://raw.githubusercontent.com/sshh12/multi_token/main/scripts/vastai_setup.sh | bash`\n\nWhile training I run: `source ./scripts/vastai_sync.sh $INSTANCE_ID` to sync the output folder to my local machine.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsshh12%2Fmulti_token","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsshh12%2Fmulti_token","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsshh12%2Fmulti_token/lists"}