{"id":13935508,"url":"https://github.com/lucidrains/flamingo-pytorch","last_synced_at":"2025-05-15T14:08:41.382Z","repository":{"id":37647181,"uuid":"486656484","full_name":"lucidrains/flamingo-pytorch","owner":"lucidrains","description":"Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch","archived":false,"fork":false,"pushed_at":"2022-10-18T21:44:45.000Z","size":217,"stargazers_count":1237,"open_issues_count":9,"forks_count":62,"subscribers_count":21,"default_branch":"main","last_synced_at":"2025-04-07T21:07:51.602Z","etag":null,"topics":["artificial-intelligence","attention-mechanism","deep-learning","transformers","visual-question-answering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-04-28T15:47:33.000Z","updated_at":"2025-04-02T18:13:55.000Z","dependencies_parsed_at":"2022-07-11T05:45:58.321Z","dependency_job_id":null,"html_url":"https://github.com/lucidrains/flamingo-pytorch","commit_stats":null,"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fflamingo-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fflamingo-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fflamingo-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fflamingo-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/flamingo-pytorch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254355335,"owners_count":22057354,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","attention-mechanism","deep-learning","transformers","visual-question-answering"],"created_at":"2024-08-07T23:01:49.679Z","updated_at":"2025-05-15T14:08:36.364Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cimg src=\"./flamingo.png\" width=\"500px\"\u003e\u003c/img\u003e\n\n## 🦩 Flamingo - Pytorch\n\nImplementation of \u003ca href=\"https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model\"\u003eFlamingo\u003c/a\u003e, state-of-the-art few-shot visual question answering attention net, in Pytorch. It will include the perceiver resampler (including the scheme where the learned queries contributes keys / values to be attended to, in addition to media embeddings), the specialized masked cross attention blocks, and finally the tanh gating at the ends of the cross attention + corresponding feedforward blocks\n\n\u003ca href=\"https://youtu.be/smUHQndcmOY?t=30\"\u003eYannic Kilcher presentation\u003c/a\u003e\n\n## Install\n\n```bash\n$ pip install flamingo-pytorch\n```\n\n## Usage\n\n```python\nimport torch\nfrom flamingo_pytorch import PerceiverResampler\n\nperceive = PerceiverResampler(\n    dim = 1024,\n    depth = 2,\n    dim_head = 64,\n    heads = 8,\n    num_latents = 64,    # the number of latents to shrink your media sequence to, perceiver style\n    num_time_embeds = 4  # say you have 4 images maximum in your dialogue\n)\n\nmedias = torch.randn(1, 2, 256, 1024) # (batch, time, sequence length, dimension)\nperceived = perceive(medias) # (1, 2, 64, 1024) - (batch, time, num latents, dimension)\n```\n\nThen you insert the `GatedCrossAttentionBlock` at different intervals in your giant language model. Your text would then attend to the perceived media from above\n\nThe recommended way to derive the `media_locations` boolean tensor would be to allocate a special token id to the media, and then, at the start of your large language model, do `media_locations = text_id == media_token_id`\n\n```python\nimport torch\nfrom flamingo_pytorch import GatedCrossAttentionBlock\n\ncross_attn = GatedCrossAttentionBlock(\n    dim = 1024,\n    dim_head = 64,\n    heads = 8\n)\n\ntext = torch.randn(1, 512, 1024)\nperceived = torch.randn(1, 2, 64, 1024)\n\nmedia_locations = torch.randint(0, 2, (1, 512)).bool()\n\ntext = cross_attn(\n    text,\n    perceived,\n    media_locations = media_locations\n)\n```\n\nThat's it!\n\nAttention is all you need.\n\n## Full working example with Flamingo + PaLM 🌴🦩🌴\n\nIntegration with \u003ca href=\"https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html\"\u003ePaLM\u003c/a\u003e\n\nFirst install `vit-pytorch` for the vision encoder\n\n```bash\n$ pip install vit-pytorch\n```\n\nThen\n\n```python\nfrom vit_pytorch.vit import ViT\nfrom vit_pytorch.extractor import Extractor\n\nvit = ViT(\n    image_size = 256,\n    patch_size = 32,\n    num_classes = 1000,\n    dim = 1024,\n    depth = 6,\n    heads = 16,\n    mlp_dim = 2048,\n    dropout = 0.1,\n    emb_dropout = 0.1\n)\n\nvit = Extractor(vit, return_embeddings_only = True)\n\n# first take your trained image encoder and wrap it in an adapter that returns the image embeddings\n# here we use the ViT from the vit-pytorch library\n\nimport torch\nfrom flamingo_pytorch import FlamingoPaLM\n\n# a PaLM language model, the 540 billion parameter model from google that shows signs of general intelligence\n\nflamingo_palm = FlamingoPaLM(\n    num_tokens = 20000,          # number of tokens\n    dim = 1024,                  # dimensions\n    depth = 12,                  # depth\n    heads = 8,                   # attention heads\n    dim_head = 64,               # dimension per attention head\n    img_encoder = vit,           # plugin your image encoder (this can be optional if you pass in the image embeddings separately, but probably want to train end to end given the perceiver resampler)\n    media_token_id = 3,          # the token id representing the [media] or [image]\n    cross_attn_every = 3,        # how often to cross attend\n    perceiver_num_latents = 64,  # perceiver number of latents, should be smaller than the sequence length of the image tokens\n    perceiver_depth = 2          # perceiver resampler depth\n)\n\n# train your PaLM as usual\n\ntext = torch.randint(0, 20000, (2, 512))\n\npalm_logits = flamingo_palm(text)\n\n# after much training off the regular PaLM logits\n# now you are ready to train Flamingo + PaLM\n# by passing in images, it automatically freezes everything but the perceiver and cross attention blocks, as in the paper\n\ndialogue = torch.randint(0, 20000, (4, 512))\nimages = torch.randn(4, 2, 3, 256, 256)\n\nflamingo_logits = flamingo_palm(dialogue, images)\n\n# do your usual cross entropy loss\n```\n\nIt is quite evident where this is all headed if you think beyond just images.\n\n## Inception\n\nFor factual correctness, just imagine where this system would stand if one were to use \u003ca href=\"https://github.com/lucidrains/retro-pytorch\"\u003ea state of the art retrieval language model\u003c/a\u003e as the base.\n\n## Citations\n\n```bibtex\n@article{Alayrac2022Flamingo,\n    title   = {Flamingo: a Visual Language Model for Few-Shot Learning},\n    author  = {Jean-Baptiste Alayrac et al},\n    year    = {2022}\n}\n```\n\n```bibtex\n@inproceedings{Chowdhery2022PaLMSL,\n    title   = {PaLM: Scaling Language Modeling with Pathways},\n    author  = {Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam M. Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Benton C. Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garc{\\'i}a and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Oliveira Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark Diaz and Orhan Firat and Michele Catasta and Jason Wei and Kathleen S. Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel},\n    year    = {2022}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fflamingo-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fflamingo-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fflamingo-pytorch/lists"}