{"id":13569432,"url":"https://github.com/mlfoundations/open_flamingo","last_synced_at":"2025-04-09T00:29:11.687Z","repository":{"id":148505491,"uuid":"554523373","full_name":"mlfoundations/open_flamingo","owner":"mlfoundations","description":"An open-source framework for training large multimodal models.","archived":false,"fork":false,"pushed_at":"2024-08-31T23:11:03.000Z","size":7719,"stargazers_count":3872,"open_issues_count":49,"forks_count":300,"subscribers_count":48,"default_branch":"main","last_synced_at":"2025-04-01T23:20:10.999Z","etag":null,"topics":["computer-vision","deep-learning","flamingo","in-context-learning","language-model","multimodal-learning","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlfoundations.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-10-20T00:32:35.000Z","updated_at":"2025-04-01T20:50:18.000Z","dependencies_parsed_at":"2024-01-17T01:34:58.894Z","dependency_job_id":"461ae301-5a46-4917-ba87-850cb36571fe","html_url":"https://github.com/mlfoundations/open_flamingo","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Fopen_flamingo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Fopen_flamingo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Fopen_flamingo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlfoundations%2Fopen_flamingo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlfoundations","download_url":"https://codeload.github.com/mlfoundations/open_flamingo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247949066,"owners_count":21023267,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","deep-learning","flamingo","in-context-learning","language-model","multimodal-learning","pytorch"],"created_at":"2024-08-01T14:00:39.942Z","updated_at":"2025-04-09T00:29:11.671Z","avatar_url":"https://github.com/mlfoundations.png","language":"Python","funding_links":[],"categories":["Python","Open Source LLM","LLM-List","多模态大模型","Optimized Computation","Vision LLM for Generation","LLM","Multimodal Models","2023 Jan. to Mar.","7. Resources","Training and Fine-Tuning"],"sub_categories":["Open-LLM","网络服务_其他","AI Agents Stack","LangManus","7.3 Code Repositories \u0026 Tools","Libraries"],"readme":"# 🦩 OpenFlamingo\n\n[![PyPI version](https://badge.fury.io/py/open_flamingo.svg)](https://badge.fury.io/py/open_flamingo)\n\n[Paper](https://arxiv.org/abs/2308.01390) | Blog posts: [1](https://laion.ai/blog/open-flamingo/), [2](https://laion.ai/blog/open-flamingo-v2/) | [Demo](https://huggingface.co/spaces/openflamingo/OpenFlamingo)\n\nWelcome to our open source implementation of DeepMind's [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model)! \n\nIn this repository, we provide a PyTorch implementation for training and evaluating OpenFlamingo models.\nIf you have any questions, please feel free to open an issue. We also welcome contributions!\n\n# Table of Contents\n- [Installation](#installation)\n- [Approach](#approach)\n  * [Model architecture](#model-architecture)\n- [Usage](#usage)\n  * [Initializing an OpenFlamingo model](#initializing-an-openflamingo-model)\n  * [Generating text](#generating-text)\n- [Training](#training)\n  * [Dataset](#dataset)\n- [Evaluation](#evaluation)\n- [Future plans](#future-plans)\n- [Team](#team)\n- [Acknowledgments](#acknowledgments)\n- [Citing](#citing)\n\n# Installation\n\nTo install the package in an existing environment, run \n```\npip install open-flamingo\n```\n\nor to create a conda environment for running OpenFlamingo, run\n```\nconda env create -f environment.yml\n```\n\nTo install training or eval dependencies, run one of the first two commands. To install everything, run the third command.\n```\npip install open-flamingo[training]\npip install open-flamingo[eval]\npip install open-flamingo[all]\n```\n\nThere are three `requirements.txt` files: \n- `requirements.txt` \n- `requirements-training.txt`\n- `requirements-eval.txt`\n\nDepending on your use case, you can install any of these with `pip install -r \u003crequirements-file.txt\u003e`. The base file contains only the dependencies needed for running the model.\n\n## Development\n\nWe use pre-commit hooks to align formatting with the checks in the repository. \n1. To install pre-commit, run\n    ```\n    pip install pre-commit\n    ```\n    or use brew for MacOS\n    ```\n    brew install pre-commit\n    ```\n2. Check the version installed with\n    ```\n    pre-commit --version\n    ```\n3. Then at the root of this repository, run\n    ```\n    pre-commit install\n    ```\nThen every time we run git commit, the checks are run. If the files are reformatted by the hooks, run `git add` for your changed files and `git commit` again\n\n# Approach\nOpenFlamingo is a multimodal language model that can be used for a variety of tasks. It is trained on a large multimodal dataset (e.g. Multimodal C4) and can be used to generate text conditioned on interleaved images/text. For example, OpenFlamingo can be used to generate a caption for an image, or to generate a question given an image and a text passage. The benefit of this approach is that we are able to rapidly adapt to new tasks using in-context learning.\n\n## Model architecture\nOpenFlamingo combines a pretrained vision encoder and a language model using cross attention layers. The model architecture is shown below.\n\n![OpenFlamingo architecture](docs/flamingo.png) \nCredit: [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model)\n\n# Usage\n## Initializing an OpenFlamingo model\nWe support pretrained vision encoders from the [OpenCLIP](https://github.com/mlfoundations/open_clip) package, which includes OpenAI's pretrained models. \nWe also support pretrained language models from the `transformers` package, such as [MPT](https://huggingface.co/models?search=mosaicml%20mpt), [RedPajama](https://huggingface.co/models?search=redpajama), [LLaMA](https://huggingface.co/models?search=llama), [OPT](https://huggingface.co/models?search=opt), [GPT-Neo](https://huggingface.co/models?search=gpt-neo), [GPT-J](https://huggingface.co/models?search=gptj), and [Pythia](https://huggingface.co/models?search=pythia) models.\n\n``` python\nfrom open_flamingo import create_model_and_transforms\n\nmodel, image_processor, tokenizer = create_model_and_transforms(\n    clip_vision_encoder_path=\"ViT-L-14\",\n    clip_vision_encoder_pretrained=\"openai\",\n    lang_encoder_path=\"anas-awadalla/mpt-1b-redpajama-200b\",\n    tokenizer_path=\"anas-awadalla/mpt-1b-redpajama-200b\",\n    cross_attn_every_n_layers=1,\n    cache_dir=\"PATH/TO/CACHE/DIR\"  # Defaults to ~/.cache\n)\n```\n\n## Released OpenFlamingo models\nWe have trained the following OpenFlamingo models so far.\n\n|# params|Language model|Vision encoder|Xattn interval*|COCO 4-shot CIDEr|VQAv2 4-shot Accuracy|Weights|\n|------------|--------------|--------------|----------|-----------|-------|----|\n|3B| anas-awadalla/mpt-1b-redpajama-200b | openai CLIP ViT-L/14 | 1 | 77.3 | 45.8 |[Link](https://huggingface.co/openflamingo/OpenFlamingo-3B-vitl-mpt1b)|\n|3B| anas-awadalla/mpt-1b-redpajama-200b-dolly | openai CLIP ViT-L/14 | 1 | 82.7 | 45.7 |[Link](https://huggingface.co/openflamingo/OpenFlamingo-3B-vitl-mpt1b-langinstruct)|\n|4B| togethercomputer/RedPajama-INCITE-Base-3B-v1 | openai CLIP ViT-L/14 | 2 | 81.8 | 49.0 | [Link](https://huggingface.co/openflamingo/OpenFlamingo-4B-vitl-rpj3b)|\n|4B| togethercomputer/RedPajama-INCITE-Instruct-3B-v1 | openai CLIP ViT-L/14 | 2 | 85.8 | 49.0 | [Link](https://huggingface.co/openflamingo/OpenFlamingo-4B-vitl-rpj3b-langinstruct)|\n|9B| anas-awadalla/mpt-7b | openai CLIP ViT-L/14 | 4 | 89.0 | 54.8 | [Link](https://huggingface.co/openflamingo/OpenFlamingo-9B-vitl-mpt7b)|\n\n*\\* Xattn interval refers to the `--cross_attn_every_n_layers` argument.*\n\nNote: as part of our v2 release, we have deprecated a previous LLaMA-based checkpoint. However, you can continue to use our older checkpoint using the new codebase.\n\n## Downloading pretrained weights\n\nTo instantiate an OpenFlamingo model with one of our released weights, initialize the model as above and use the following code.\n\n```python\n# grab model checkpoint from huggingface hub\nfrom huggingface_hub import hf_hub_download\nimport torch\n\ncheckpoint_path = hf_hub_download(\"openflamingo/OpenFlamingo-3B-vitl-mpt1b\", \"checkpoint.pt\")\nmodel.load_state_dict(torch.load(checkpoint_path), strict=False)\n```\n\n## Generating text\nBelow is an example of generating text conditioned on interleaved images/text. In particular, let's try few-shot image captioning.\n\n``` python\nfrom PIL import Image\nimport requests\nimport torch\n\n\"\"\"\nStep 1: Load images\n\"\"\"\ndemo_image_one = Image.open(\n    requests.get(\n        \"http://images.cocodataset.org/val2017/000000039769.jpg\", stream=True\n    ).raw\n)\n\ndemo_image_two = Image.open(\n    requests.get(\n        \"http://images.cocodataset.org/test-stuff2017/000000028137.jpg\",\n        stream=True\n    ).raw\n)\n\nquery_image = Image.open(\n    requests.get(\n        \"http://images.cocodataset.org/test-stuff2017/000000028352.jpg\", \n        stream=True\n    ).raw\n)\n\n\n\"\"\"\nStep 2: Preprocessing images\nDetails: For OpenFlamingo, we expect the image to be a torch tensor of shape \n batch_size x num_media x num_frames x channels x height x width. \n In this case batch_size = 1, num_media = 3, num_frames = 1,\n channels = 3, height = 224, width = 224.\n\"\"\"\nvision_x = [image_processor(demo_image_one).unsqueeze(0), image_processor(demo_image_two).unsqueeze(0), image_processor(query_image).unsqueeze(0)]\nvision_x = torch.cat(vision_x, dim=0)\nvision_x = vision_x.unsqueeze(1).unsqueeze(0)\n\n\"\"\"\nStep 3: Preprocessing text\nDetails: In the text we expect an \u003cimage\u003e special token to indicate where an image is.\n We also expect an \u003c|endofchunk|\u003e special token to indicate the end of the text \n portion associated with an image.\n\"\"\"\ntokenizer.padding_side = \"left\" # For generation padding tokens should be on the left\nlang_x = tokenizer(\n    [\"\u003cimage\u003eAn image of two cats.\u003c|endofchunk|\u003e\u003cimage\u003eAn image of a bathroom sink.\u003c|endofchunk|\u003e\u003cimage\u003eAn image of\"],\n    return_tensors=\"pt\",\n)\n\n\n\"\"\"\nStep 4: Generate text\n\"\"\"\ngenerated_text = model.generate(\n    vision_x=vision_x,\n    lang_x=lang_x[\"input_ids\"],\n    attention_mask=lang_x[\"attention_mask\"],\n    max_new_tokens=20,\n    num_beams=3,\n)\n\nprint(\"Generated text: \", tokenizer.decode(generated_text[0]))\n```\n\n# Training\nWe provide training scripts in `open_flamingo/train`. We provide an example Slurm script in `open_flamingo/scripts/run_train.py`, as well as the following example command:\n```\ntorchrun --nnodes=1 --nproc_per_node=4 open_flamingo/train/train.py \\\n  --lm_path anas-awadalla/mpt-1b-redpajama-200b \\\n  --tokenizer_path anas-awadalla/mpt-1b-redpajama-200b \\\n  --cross_attn_every_n_layers 1 \\\n  --dataset_resampled \\\n  --batch_size_mmc4 32 \\\n  --batch_size_laion 64 \\\n  --train_num_samples_mmc4 125000\\\n  --train_num_samples_laion 250000 \\\n  --loss_multiplier_laion 0.2 \\\n  --workers=4 \\\n  --run_name OpenFlamingo-3B-vitl-mpt1b \\\n  --num_epochs 480 \\\n  --warmup_steps  1875 \\\n  --mmc4_textsim_threshold 0.24 \\\n  --laion_shards \"/path/to/shards/shard-{0000..0999}.tar\" \\\n  --mmc4_shards \"/path/to/shards/shard-{0000..0999}.tar\" \\\n  --report_to_wandb\n```\n\n*Note: The MPT-1B [base](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b)  and [instruct](https://huggingface.co/mosaicml/mpt-1b-redpajama-200b-dolly) modeling code does not accept the `labels` kwarg or compute cross-entropy loss directly within `forward()`, as expected by our codebase. We suggest using a modified version of the MPT-1B models found [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b) and [here](https://huggingface.co/anas-awadalla/mpt-1b-redpajama-200b-dolly).*\n\nFor more details, see our [training README](https://github.com/mlfoundations/open_flamingo/tree/main/open_flamingo/train).\n\n\n# Evaluation\nAn example evaluation script is at `open_flamingo/scripts/run_eval.sh`. Please see our [evaluation README](https://github.com/mlfoundations/open_flamingo/tree/main/open_flamingo/eval) for more details.\n\n\nTo run evaluations on OKVQA you will need to run the following command:\n```\nimport nltk\nnltk.download('wordnet')\n```\n\n\n# Future plans\n- [ ] Add support for video input\n\n# Team\n\nOpenFlamingo is developed by:\n\n[Anas Awadalla*](https://anas-awadalla.streamlit.app/), [Irena Gao*](https://i-gao.github.io/), [Joshua Gardner](https://homes.cs.washington.edu/~jpgard/), [Jack Hessel](https://jmhessel.com/), [Yusuf Hanafy](https://www.linkedin.com/in/yusufhanafy/), [Wanrong Zhu](https://wanrong-zhu.com/), [Kalyani Marathe](https://sites.google.com/uw.edu/kalyanimarathe/home?authuser=0), [Yonatan Bitton](https://yonatanbitton.github.io/), [Samir Gadre](https://sagadre.github.io/), [Shiori Sagawa](https://cs.stanford.edu/~ssagawa/), [Jenia Jitsev](https://scholar.google.de/citations?user=p1FuAMkAAAAJ\u0026hl=en), [Simon Kornblith](https://simonster.com/), [Pang Wei Koh](https://koh.pw/), [Gabriel Ilharco](https://gabrielilharco.com/), [Mitchell Wortsman](https://mitchellnw.github.io/), [Ludwig Schmidt](https://people.csail.mit.edu/ludwigs/).\n\nThe team is primarily from the University of Washington, Stanford, AI2, UCSB, and Google.\n\n# Acknowledgments\nThis code is based on Lucidrains' [flamingo implementation](https://github.com/lucidrains/flamingo-pytorch) and David Hansmair's [flamingo-mini repo](https://github.com/dhansmair/flamingo-mini). Thank you for making your code public! We also thank the [OpenCLIP](https://github.com/mlfoundations/open_clip) team as we use their data loading code and take inspiration from their library design.\n\nWe would also like to thank [Jean-Baptiste Alayrac](https://www.jbalayrac.com) and [Antoine Miech](https://antoine77340.github.io) for their advice, [Rohan Taori](https://www.rohantaori.com/), [Nicholas Schiefer](https://nicholasschiefer.com/), [Deep Ganguli](https://hai.stanford.edu/people/deep-ganguli), [Thomas Liao](https://thomasliao.com/), [Tatsunori Hashimoto](https://thashim.github.io/), and [Nicholas Carlini](https://nicholas.carlini.com/) for their help with assessing the safety risks of our release, and to [Stability AI](https://stability.ai) for providing us with compute resources to train these models.\n\n# Citing\nIf you found this repository useful, please consider citing:\n\n```\n@article{awadalla2023openflamingo,\n  title={OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models},\n  author={Anas Awadalla and Irena Gao and Josh Gardner and Jack Hessel and Yusuf Hanafy and Wanrong Zhu and Kalyani Marathe and Yonatan Bitton and Samir Gadre and Shiori Sagawa and Jenia Jitsev and Simon Kornblith and Pang Wei Koh and Gabriel Ilharco and Mitchell Wortsman and Ludwig Schmidt},\n  journal={arXiv preprint arXiv:2308.01390},\n  year={2023}\n}\n```\n\n```\n@software{anas_awadalla_2023_7733589,\n  author = {Awadalla, Anas and Gao, Irena and Gardner, Joshua and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Jitsev, Jenia and Kornblith, Simon and Koh, Pang Wei and Ilharco, Gabriel and Wortsman, Mitchell and Schmidt, Ludwig},\n  title = {OpenFlamingo},\n  month        = mar,\n  year         = 2023,\n  publisher    = {Zenodo},\n  version      = {v0.1.1},\n  doi          = {10.5281/zenodo.7733589},\n  url          = {https://doi.org/10.5281/zenodo.7733589}\n}\n```\n\n```\n@article{Alayrac2022FlamingoAV,\n  title={Flamingo: a Visual Language Model for Few-Shot Learning},\n  author={Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan},\n  journal={ArXiv},\n  year={2022},\n  volume={abs/2204.14198}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlfoundations%2Fopen_flamingo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlfoundations%2Fopen_flamingo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlfoundations%2Fopen_flamingo/lists"}