{"id":25304293,"url":"https://github.com/evolvinglmms-lab/otter","last_synced_at":"2025-12-13T20:51:22.490Z","repository":{"id":159196845,"uuid":"622202906","full_name":"EvolvingLMMs-Lab/Otter","owner":"EvolvingLMMs-Lab","description":"🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.","archived":false,"fork":false,"pushed_at":"2024-03-05T15:56:06.000Z","size":7753,"stargazers_count":3253,"open_issues_count":62,"forks_count":212,"subscribers_count":81,"default_branch":"main","last_synced_at":"2025-05-14T23:07:29.548Z","etag":null,"topics":["artificial-inteligence","chatgpt","deep-learning","embodied-ai","foundation-models","gpt-4","instruction-tuning","large-scale-models","machine-learning","multi-modality","visual-language-learning"],"latest_commit_sha":null,"homepage":"https://otter-ntu.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EvolvingLMMs-Lab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-01T12:31:49.000Z","updated_at":"2025-05-12T10:11:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"e74a82aa-5f05-4a04-acd8-c3e66ea488a2","html_url":"https://github.com/EvolvingLMMs-Lab/Otter","commit_stats":null,"previous_names":["evolvinglmms-lab/otter","luodian/otter"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FOtter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FOtter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FOtter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvolvingLMMs-Lab%2FOtter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EvolvingLMMs-Lab","download_url":"https://codeload.github.com/EvolvingLMMs-Lab/Otter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254243362,"owners_count":22038046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-inteligence","chatgpt","deep-learning","embodied-ai","foundation-models","gpt-4","instruction-tuning","large-scale-models","machine-learning","multi-modality","visual-language-learning"],"created_at":"2025-02-13T08:07:21.861Z","updated_at":"2025-12-13T20:51:22.434Z","avatar_url":"https://github.com/EvolvingLMMs-Lab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"https://i.postimg.cc/mksBCbV9/brand-title.png\"  width=\"80%\" height=\"80%\"\u003e\n\u003c/p\u003e\n\n---\n![](https://img.shields.io/badge/otter-v0.3-darkcyan)\n[![Twitter](https://img.shields.io/twitter/url/https/twitter.com/cloudposse.svg?style=social\u0026label=Follow%20%40Us)](https://twitter.com/BoLi68567011)\n![](https://img.shields.io/github/stars/luodian/otter?style=social)\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fgithub.com%2FLuodian%2Fotter\u0026count_bg=%23FFA500\u0026title_bg=%23555555\u0026icon=\u0026icon_color=%23E7E7E7\u0026title=visitors\u0026edge_flat=false)](https://hits.seeyoufarm.com)\n[![litellm](https://img.shields.io/badge/%20%F0%9F%9A%85%20liteLLM-OpenAI%7CAzure%7CAnthropic%7CPalm%7CCohere-blue?color=green)](https://github.com/BerriAI/litellm)\n\n[Project Credits](https://github.com/Luodian/Otter/blob/main/docs/credits.md) | [Otter Paper](https://arxiv.org/abs/2305.03726) | [OtterHD Paper](https://arxiv.org/abs/2311.04219) | [MIMIC-IT Paper](https://arxiv.org/abs/2306.05425)\n\n**Checkpoints:**\n\n- [luodian/OTTER-Image-MPT7B](https://huggingface.co/luodian/OTTER-Image-MPT7B)\n- [luodian/OTTER-Video-LLaMA7B-DenseCaption](https://huggingface.co/luodian/OTTER-Video-LLaMA7B-DenseCaption)\n\nFor who in the mainland China: [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/YuanhanZhang/OTTER-Image-MPT7B) | [![Open in OpenXLab](https://cdn-static.openxlab.org.cn/header/openxlab_models.svg)](https://openxlab.org.cn/models/detail/YuanhanZhang/OTTER-Video-LLaMA7B-DenseCaption)\n\n**Disclaimer:** The code may not be perfectly polished and refactored, but **all opensourced codes are tested and runnable** as we also use the code to support our research. If you have any questions, please feel free to open an issue. We are eagerly looking forward to suggestions and PRs to improve the code quality.\n\n## 🦾 Update\n\n**[2023-11]: Supporting GPT4V's Evaluation on 8 Benchmarks; Anouncing OtterHD-8B, improved from Fuyu-8B. Checkout [OtterHD](./docs/OtterHD.md) for details.**\n\n\u003cdiv style=\"text-align:center\"\u003e\n\u003cimg src=\"https://i.postimg.cc/dtxQQzt6/demo0.png\"  width=\"100%\" height=\"100%\"\u003e\n\u003c/div\u003e\n\n1. 🦦 Added [OtterHD](./docs/OtterHD.md), a multimodal fine-tuned from [Fuyu-8B](https://huggingface.co/adept/fuyu-8b) to facilitate fine-grained interpretations of high-resolution visual input *without a explicit vision encoder module*. All image patches are linear transformed and processed together with text tokens. This is a very innovative and elegant exploration. We are fascinated and paved in this way, we opensourced the finetune script for Fuyu-8B and improve training throughput by 4-5 times faster with [Flash-Attention-2](https://github.com/Dao-AILab/flash-attention). Try our finetune script at [OtterHD](./docs/OtterHD.md).\n2. 🔍 Added [MagnifierBench](./docs/OtterHD.md), an evaluation benchmark tailored to assess whether the model can identify the tiny objects' information (1% image size) and spatial relationships.\n3. Improved pipeline for [Pretrain](pipeline/train/pretraining.py) | [SFT](pipeline/train/instruction_following.py) | [RLHF]() with (part of) current leading LMMs.\n   1. **Models**: [Otter](https://arxiv.org/abs/2305.03726) | [OpenFlamingo](https://arxiv.org/abs/2308.01390) | [Idefics](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) | [Fuyu](https://huggingface.co/adept/fuyu-8b)\n   2. **Training Datasets Interface: (Pretrain)** MMC4 | LAION2B | CC3M | CC12M, **(SFT)** MIMIC-IT | M3IT | LLAVAR | LRV | SVIT...\n        - *We tested above datasets for both pretraining and instruction tuning with OpenFlamingo and Otter. We also tested the datasets with Idefics and Fuyu for instruction tuning. We will opensource the training scripts gradually.*\n   3. [**Benchmark Interface**](https://huggingface.co/Otter-AI): MagnifierBench/MMBench/MM-VET/MathVista/POPE/MME/SicenceQA/SeedBench. Run them can be in one-click, please see [Benchmark](./docs/benchmark_eval.md) for details.\n    ```yaml\n        datasets:\n        - name: magnifierbench\n            split: test\n            prompt: Answer with the option's letter from the given choices directly.\n            api_key: [Your API Key] # GPT4 or GPT3.5 to evaluate the answers and ground truth.\n            debug: true # put debug=true will save the model response in log file.\n        - name: mme\n            split: test\n            debug: true\n        - name: mmbench\n            split: test\n            debug: true\n\n        models:\n        - name: gpt4v\n            api_key: [Your API Key] # to call GPT4V model.\n    ```\n   4. **Code refactorization** for **organizing multiple groups of datasets with integrated yaml file**, see details at [managing datasets in MIMIC-IT format](docs/mimicit_format.md). For example, \n    ```yaml\n        IMAGE_TEXT: # Group name should be in [IMAGE_TEXT, TEXT_ONLY, IMAGE_TEXT_IN_CONTEXT]\n            LADD: # Dataset name can be assigned at any name you want\n                mimicit_path: azure_storage/json/LA/LADD_instructions.json # Path of the instruction json file\n                images_path: azure_storage/Parquets/LA.parquet # Path of the image parquet file\n                num_samples: -1 # Number of samples you want to use, -1 means use all samples, if not set, default is -1.\n            M3IT_CAPTIONING:\n                mimicit_path: azure_storage/json/M3IT/captioning/coco/coco_instructions.json\n                images_path: azure_storage/Parquets/coco.parquet\n                num_samples: 20000\n    ```\n   *This is a major change and would result previous code not runnable, please check the details.*\n\n**[2023-08]**\n\n1. Added Support for using Azure, Anthropic, Palm, Cohere models for Self-Instruct with Syphus pipeline, for information on usage modify [this line](https://github.com/Luodian/Otter/blob/16d73b399fac6352ebff7504b1acb1f228fbf3f4/mimic-it/syphus/file_utils.py#L53) with your selected model and set your API keys in the environment. For more information see [LiteLLM](https://github.com/BerriAI/litellm/)\n\n**[2023-07]: Anouncing MIMIC-IT dataset for multiple interleaved image-text/video instruction tuning.**\n\n1. 🤗 Checkout [MIMIC-IT](https://huggingface.co/datasets/pufanyi/MIMICIT) on Huggingface datasets.\n2. 🥚 Update [Eggs](./mimic-it/README.md/#eggs) section for downloading MIMIC-IT dataset.\n3. 🥃 Contact us **if you wish to develop Otter for your scenarios** (for satellite images or funny videos?). We aim to support and assist with Otter's diverse use cases. OpenFlamingo and Otter are strong models with the [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model)'s excellently designed architecture that accepts multiple images/videos or other modality inputs. Let's build more interesting models together.\n\n**[2023-06]**\n\n1. 🧨 [Download MIMIC-IT Dataset](https://entuedu-my.sharepoint.com/:f:/g/personal/libo0013_e_ntu_edu_sg/Eo9bgNV5cjtEswfA-HfjNNABiKsjDzSWAl5QYAlRZPiuZA?e=M9isDT). For more details on navigating the dataset, please refer to [MIMIC-IT Dataset README](mimic-it/README.md).\n2. 🏎️ [Run Otter Locally](./pipeline/demo). You can run our model locally with at least 16G GPU mem for tasks like image/video tagging and captioning and identifying harmful content. We fix a bug related to video inference where `frame tensors` were mistakenly unsqueezed to a wrong `vision_x`.\n   \u003e Make sure to adjust the `sys.path.append(\"../..\")` correctly to access `otter.modeling_otter` in order to launch the model.\n3. 🤗 Check our [paper](https://arxiv.org/abs/2306.05425) introducing MIMIC-IT in details. Meet MIMIC-IT, the first multimodal in-context instruction tuning dataset with 2.8M instructions! From general scene understanding to spotting subtle differences and enhancing egocentric view comprehension for AR headsets, our MIMIC-IT dataset has it all.\n\n## 🦦 Why In-Context Instruction Tuning?\n\nLarge Language Models (LLMs) have demonstrated exceptional universal aptitude as few/zero-shot learners for numerous tasks, owing to their pre-training on extensive text data. Among these LLMs, GPT-3 stands out as a prominent model with significant capabilities. Additionally, variants of GPT-3, namely InstructGPT and ChatGPT, have proven effective in interpreting natural language instructions to perform complex real-world tasks, thanks to instruction tuning.\n\nMotivated by the upstream interleaved format pretraining of the Flamingo model, we present 🦦 Otter, a multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo). We train our Otter in an in-context instruction tuning way on our proposed **MI**-**M**odal **I**n-**C**ontext **I**nstruction **T**uning (**MIMIC-IT**) dataset. Otter showcases improved instruction-following and in-context learning ability in both images and videos.\n\n## 🗄 MIMIC-IT Dataset Details\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"https://i.postimg.cc/yYMm1G5X/mimicit-logo.png\"  width=\"80%\" height=\"80%\"\u003e\n\u003c/p\u003e\n\nMIMIC-IT enables the application of egocentric visual assistant model that can serve that can answer your questions like **Hey, Do you think I left my keys on the table?**. Harness the power of MIMIC-IT to unlock the full potential of your AI-driven visual assistant and elevate your interactive vision-language tasks to new heights.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"https://i.postimg.cc/RCGp0vQ1/syphus.png\"  width=\"80%\" height=\"80%\"\u003e\n\u003c/p\u003e\n\nWe also introduce **Syphus**, an automated pipeline for generating high-quality instruction-response pairs in multiple languages. Building upon the framework proposed by LLaVA, we utilize ChatGPT to generate instruction-response pairs based on visual content. To ensure the quality of the generated instruction-response pairs, our pipeline incorporates system messages, visual annotations, and in-context examples as prompts for ChatGPT.\n\nFor more details, please check the [MIMIC-IT dataset](mimic-it/README.md).\n\n## 🤖 Otter Model Details\n\n\u003cdiv style=\"text-align:center\"\u003e\n\u003cimg src=\"https://i.postimg.cc/CKgQ2PP7/otter-teaser.png\"  width=\"100%\" height=\"100%\"\u003e\n\u003c/div\u003e\n\nOtter is designed to support multi-modal in-context instruction tuning based on the OpenFlamingo model, which involves conditioning the language model on the corresponding media, such as an image that corresponds to a caption or an instruction-response pair.\n\nWe train Otter on MIMIC-IT dataset with approximately 2.8 million in-context instruction-response pairs, which are structured into a cohesive template to facilitate various tasks. Otter supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs as in-context examples, which is **the first multi-modal instruction tuned model**.\n\nThe following template encompasses images, user instructions, and model-generated responses, utilizing the `User` and `GPT` role labels to enable seamless user-assistant interactions.\n\n```python\nprompt = f\"\u003cimage\u003eUser: {instruction} GPT:\u003canswer\u003e {response}\u003cendofchunk\u003e\"\n```\n\nTraining the Otter model on the MIMIC-IT dataset allows it to acquire different capacities, as demonstrated by the LA and SD tasks. Trained on the LA task, the model exhibits exceptional scene comprehension, reasoning abilities, and multi-round conversation capabilities.\n\n```python\n# multi-round of conversation\nprompt = f\"\u003cimage\u003eUser: {first_instruction} GPT:\u003canswer\u003e {first_response}\u003cendofchunk\u003eUser: {second_instruction} GPT:\u003canswer\u003e\"\n```\n\nRegarding the concept of organizing visual-language in-context examples, we demonstrate here the acquired ability of the Otter model to follow inter-contextual instructions after training on the LA-T2T task. The organized input data format is as follows:\n\n```python\n# Multiple in-context example with similar instructions\nprompt = f\"\u003cimage\u003eUser:{ict_first_instruction} GPT: \u003canswer\u003e{ict_first_response}\u003c|endofchunk|\u003e\u003cimage\u003eUser:{ict_second_instruction} GPT: \u003canswer\u003e{ict_second_response}\u003c|endofchunk|\u003e\u003cimage\u003eUser:{query_instruction} GPT: \u003canswer\u003e\"\n```\n\nFor more details, please refer to our [paper](https://arxiv.org/abs/2306.05425)'s appendix for other tasks.\n\n## 🗂️ Environments\n\n1. Compare cuda version returned by nvidia-smi and nvcc --version. They need to match. Or at least, the version get by nvcc --version should be \u003c= the version get by nvidia-smi.\n2. Install the pytorch that matches your cuda version. (e.g. cuda 11.7 torch 2.0.0). We have successfully run this code on cuda 11.1 torch 1.10.1 and cuda 11.7 torch 2.0.0. You can refer to PyTorch's documentation, [Latest](https://pytorch.org/) or [Previous](https://pytorch.org/get-started/previous-versions/).\n3. You may install via `conda env create -f environment.yml`. Especially to make sure the `transformers\u003e=4.28.0`, `accelerate\u003e=0.18.0`.\n\nAfter configuring environment, you can use the 🦩 Flamingo model / 🦦 Otter model as a 🤗 Hugging Face model with only a few lines! One-click and then model configs/weights are downloaded automatically. Please refer to [Huggingface Otter/Flamingo](./docs/huggingface_compatible.md) for details.\n\n## ☄️ Training\n\nOtter is trained based on OpenFlamingo. You may need to use converted weights at [luodian/OTTER-9B-INIT](https://huggingface.co/luodian/OTTER-9B-INIT) or [luodian/OTTER-MPT7B-Init](https://huggingface.co/luodian/OTTER-MPT7B-Init). They are respectively converted from [OpenFlamingo-LLaMA7B-v1](https://huggingface.co/openflamingo/OpenFlamingo-9B) and [OpenFlamingo-MPT7B-v2](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b), we added a `\u003canswer\u003e` token for Otter's downstream instruction tuning.\n\nYou may also use any trained Otter weights to start with your training on top of ours, see them at [Otter Weights](https://huggingface.co/luodian). You can refer to [MIMIC-IT](https://github.com/Luodian/Otter/tree/main/mimic-it) for preparing image/instruction/train json files.\n\n```bash\nexport PYTHONPATH=.\nRUN_NAME=\"Otter_MPT7B\"\nGPU=8\nWORKERS=$((${GPU}*2))\n\necho \"Using ${GPU} GPUs and ${WORKERS} workers\"\necho \"Running ${RUN_NAME}\"\n\naccelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_zero3.yaml \\\n    --num_processes=${GPU} \\\n    pipeline/train/instruction_following.py \\\n    --pretrained_model_name_or_path=luodian/OTTER-MPT7B-Init \\\n    --model_name=otter \\\n    --instruction_format=simple \\\n    --training_data_yaml=./shared_scripts/Demo_Data.yaml \\\n    --batch_size=8 \\\n    --num_epochs=3 \\\n    --report_to_wandb \\\n    --wandb_entity=ntu-slab \\\n    --external_save_dir=./checkpoints \\\n    --run_name=${RUN_NAME} \\\n    --wandb_project=Otter_MPTV \\\n    --workers=${WORKERS} \\\n    --lr_scheduler=cosine \\\n    --learning_rate=2e-5 \\\n    --warmup_steps_ratio=0.01 \\\n    --save_hf_model \\\n    --max_seq_len=1024 \\\n```\n\n## 📑 Citation\n\nIf you found this repository useful, please consider citing:\n\n```\n@article{li2023otter,\n  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},\n  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},\n  journal={arXiv preprint arXiv:2305.03726},\n  year={2023}\n}\n\n@article{li2023mimicit,\n    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},\n    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},\n    year={2023},\n    eprint={2306.05425},\n    archivePrefix={arXiv},\n    primaryClass={cs.CV}\n}\n```\n\n### 👨‍🏫 Acknowledgements\n\nWe thank [Jack Hessel](https://jmhessel.com/) for the advise and support, as well as the [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) team for their great contribution to the open source community.\n\nHuge accolades to [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model) and [OpenFlamingo](https://github.com/mlfoundations/open_flamingo) team for the work on this great architecture.\n\n### 📝 Related Projects\n\n- [LLaVA: Visual Instruction Tuning](https://github.com/haotian-liu/LLaVA)\n- [Instruction Tuning with GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Fotter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevolvinglmms-lab%2Fotter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevolvinglmms-lab%2Fotter/lists"}