{"id":34028541,"url":"https://github.com/multimodal/multimodal","last_synced_at":"2026-03-11T13:39:15.572Z","repository":{"id":40688886,"uuid":"247125886","full_name":"multimodal/multimodal","owner":"multimodal","description":"A collection of multimodal datasets, and visual features for VQA and captionning in pytorch. Just run \"pip install multimodal\"","archived":false,"fork":false,"pushed_at":"2022-02-25T12:38:58.000Z","size":2319,"stargazers_count":83,"open_issues_count":3,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-09-05T03:41:04.396Z","etag":null,"topics":["datasets","deep-learning","embeddings","reasoning","vision-and-language","visual-features"],"latest_commit_sha":null,"homepage":"https://multimodal.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/multimodal.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-03-13T17:23:34.000Z","updated_at":"2025-08-25T21:43:20.000Z","dependencies_parsed_at":"2022-07-27T15:54:25.249Z","dependency_job_id":null,"html_url":"https://github.com/multimodal/multimodal","commit_stats":null,"previous_names":["cdancette/multimodal"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/multimodal/multimodal","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multimodal%2Fmultimodal","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multimodal%2Fmultimodal/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multimodal%2Fmultimodal/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multimodal%2Fmultimodal/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/multimodal","download_url":"https://codeload.github.com/multimodal/multimodal/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/multimodal%2Fmultimodal/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30382673,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-11T12:49:11.341Z","status":"ssl_error","status_checked_at":"2026-03-11T12:46:41.342Z","response_time":84,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","deep-learning","embeddings","reasoning","vision-and-language","visual-features"],"created_at":"2025-12-13T17:11:25.168Z","updated_at":"2026-03-11T13:39:15.529Z","avatar_url":"https://github.com/multimodal.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# multimodal\n\n[![PyPI](https://img.shields.io/pypi/v/multimodal.svg)](https://pypi.python.org/pypi/multimodal/)\n[![Documentation Status](https://readthedocs.org/projects/multimodal/badge/?version=latest)](https://multimodal.readthedocs.io/en/latest/?badge=latest) [![Downloads](https://pepy.tech/badge/multimodal/week)](https://pepy.tech/project/multimodal) \n[![Join the chat at https://gitter.im/multimodal-learning/multimodal](https://badges.gitter.im/multimodal-learning/multimodal.svg)](https://gitter.im/multimodal-learning/multimodal?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n\nA collection of multimodal (vision and language) datasets and visual features for deep learning research. See the [Documentation](https://multimodal.readthedocs.io/en/latest/).\n\n**Pretrained models**\n\n- ALBEF\n\n```python\nfrom multimodal.models import ALBEF\nalbef = ALBEF.from_pretrained()\n```\n\n\n**Visual Features**\n\nCurrently it supports the following visual features (downloaded automatically): \n- COCO [Bottom-Up Top-Down](https://github.com/peteanderson80/bottom-up-attention) features (10-100)\n- COCO [Bottom-Up Top-Down](https://github.com/peteanderson80/bottom-up-attention) features (36)\n\n**Datasets**\n\nIt also supports the following datasets, with their evaluation metric ([VQA evaluation metric](https://visualqa.org/evaluation.html)) \n- VQA v1\n- VQA v2\n- VQA-CP v1\n- VQA-CP v2\n- AdVQA [https://adversarialvqa.github.io](https://adversarialvqa.github.io)\n\n- [CLEVR dataset](https://cs.stanford.edu/people/jcjohns/clevr/)\n\nNote that when instanciating those datasets, large data might be downloaded. You can always specify the `dir_data` argument when instanciating, or you can set the environment variable `MULTIMODAL_DATA_DIR` so that all data always goes to the specified directory.\n\n**Models**\n\n- Bottom-Up and Top-Down attention (UpDown)\n- ALBEF (pretrained model)\n\n\n**WordEmbeddings**\n\nAnd also word embeddings (either from scratch, or pretrained from torchtext, that can be fine-tuned).\n\n\n## Simple Usage\n\nTo install the library, run `pip install multimodal`. It is supported for python 3.6 and 3.7.\n\n\n### Visual Features\n\nAvailable features are COCOBottomUpFeatures\n\n```python\n\u003e\u003e\u003e from multimodal.features import COCOBottomUpFeatures\n\u003e\u003e\u003e bottomup = COCOBottomUpFeatures(features=\"trainval_36\", dir_data=\"/tmp\")\n\u003e\u003e\u003e image_id = 13455\n\u003e\u003e\u003e feats = bottomup[image_id]\n\u003e\u003e\u003e print(feats.keys())\n['image_w', 'image_h', 'num_boxes', 'boxes', 'features']\n\u003e\u003e\u003e print(feats[\"features\"].shape)  # numpy array\n(36, 2048)\n```\n\n### Datasets\n\n**VQA**\n\nAvailable VQA datasets are VQA, VQA v2, VQA-CP, VQA-CP v2, and their associated [pytorch-lightinng](https://pytorch-lightning.readthedocs.io/en/stable/datamodules.html) data modules.\n\nYou can run a simple evaluation of predictions using the following commands. \nData will be downloaded and processed if necessary. Predictions must have the same format as the official VQA result format (see https://visualqa.org/evaluation.html).\n```bash\n# vqa 1.0\npython -m multimodal vqa-eval -p \u003cpath/to/predictions\u003e -s \"val\"\n# vqa 2.0\npython -m multimodal vqa2-eval -p \u003cpath/to/predictions\u003e -s \"val\"\n# vqa-cp 1.0\npython -m multimodal vqacp-eval -p \u003cpath/to/predictions\u003e -s \"val\"\n# vqa-cp 2.0\npython -m multimodal vqacp2-eval -p \u003cpath/to/predictions\u003e -s \"val\"\n```\n\nTo use the datasets for your training runs, use the following:\n\n```python\n# Visual Question Answering\nfrom multimodal.datasets import VQA, VQA2, VQACP, VQACP2\n\ndataset = VQA(split=\"train\", features=\"coco-bottomup\", dir_data=\"/tmp\")\nitem = dataset[0]\n\ndataloader = torch.utils.data.Dataloader(dataset, collate_fn = VQA.collate_fn)\n\nfor batch in dataloader:\n    out = model(batch)\n    # training code...\n```\nWe also provide a pytorch_lightning datamodule, available here: `multimodal.datasets.lightning.VQADataModule` and similarly for other VQA datasets.\nSee documentation.\n\n**CLEVR**\n\n```python\nfrom multimodal.datasets import CLEVR\n\n# Warning, this will download a 18Gb file. \n# You can specify the multimodal data directory \n#   by providing the dir_data argument\nclevr = CLEVR(split=\"train\") \n```\n\n### Pretrained Tokenizer and Word embeddings\n\nWord embeddings are implemented as pytorch modules. Thus, they are trainable if needed, but can be freezed.\n\nPretrained embedding weights are downloaded with torchtext. The following pretrained embeddings are available: \n    charngram.100d, fasttext.en.300d, fasttext.simple.300d, glove.42B.300d, glove.6B.100d, glove.6B.200d, glove.6B.300d, glove.6B.50d, glove.840B.300d, glove.twitter.27B.100d, glove.twitter.27B.200d, glove.twitter.27B.25d, glove.twitter.27B.50d\n\nUsage\n\n```python\nfrom multimodal.text import PretrainedWordEmbedding\nfrom multimodal.text import BasicTokenizer\n\n# tokenizer converts words to tokens, and to token_ids. Pretrained tokenizers \n# save token_ids from an existing vocabulary.\ntokenizer = BasicTokenizer.from_pretrained(\"pretrained-vqa\")\n\n# Pretrained word embedding, freezed. A list of tokens as input to initialize embeddings.\nwemb = PretrainedWordEmbedding.from_pretrained(\"glove.840B.300d\", tokens=tokenizer.tokens, freeze=True)\n\nembeddings = wemb(tokenizer([\"Inputs are batched, and padded. This is the first batch item\", \"This is the second batch item.\"]))\n```\n\n\n### Models\n\nThe Bottom-Up and Top-Down Attention for VQA model is implemented. \nTo train, run `python multimodal/models/updown.py --dir-data \u003cpath_to_multimodal_data\u003e --dir-exp logs/vqa2/updown`\n\nIt uses pytorch lightning, with the class `multimodal.models.updown.VQALightningModule`\n\nYou can check the code to see other parameters.\n\nYou can train the model manually:\n\n```python\nfrom multimodal.models import UpDownModel\nfrom multimodal.datasets.import VQA2\nfrom multimodal.text import BasicTokenizer\nvqa_tokenizer = BasicTokenizer.from_pretrained(\"pretrained-vqa2\")\n\ntrain_dataset = VQA(split=\"train\", features=\"coco-bottomup\", dir_data=\"/tmp\")\ntrain_loader = torch.utils.data.Dataloader(train_dataset, collate_fn = VQA.collate_fn)\n\nupdown = UpDownModel(num_ans=len(train_dataset.answers))\n\nfor batch in train_loader:\n    batch[\"question_tokens\"] = vqa_tokenizer(batch[\"question\"])\n    out = updown(batch)\n    logits = out[\"logits\"]\n    loss = F.binary_cross_entropy_with_logits(logits, batch[\"label\"])\n    loss.backward()\n    optimizer.step()\n```\n\nOr train it with Pytorch Lightning:\n\n```python\nfrom multimodal.datasets.lightning import VQA2DataModule\nfrom multimodal.models.lightning import VQALightningModule\nfrom multimodal.text import BasicTokenizer\nimport pytorch_lightning as pl\n\ntokenizer = BasicTokenizer.from_pretrained(\"pretrained-vqa2\")\n\nvqa2 = VQA2DataModule(\n    features=\"coco-bottomup-36\",\n    batch_size=512,\n    num_workers=4,\n)\n\nvqa2.prepare_data()\nnum_ans = len(vqa2.num_ans)\n\nupdown = UpDownModel(\n    num_ans=num_ans,\n    tokens=tokenizer.tokens,  # to init word embeddings\n)\n\nlightningmodel = VQALightningModule(\n    updown,\n    train_dataset=vqa2.train_dataset,\n    val_dataset=vqa2.val_dataset,\n    tokenizer=tokenizer,\n)\n\ntrainer = pl.Trainer(\n    gpus=1,\n    max_epochs=30,\n    gradient_clip_val=0.25,\n    default_root_dir=\"logs/updown\",\n)\n\ntrainer.fit(lightningmodel, datamodule=vqa2)\n```\n\n\n### API \n\n#### Features\n\n```python\nfeatures = COCOBottomUpFeatures(\n    features=\"test2014_36\",   # one of [trainval2014, trainval2014_36, test2014, test2014_36, test2015, test2015_36]\n    dir_data=None             # directory for multimodal data. By default, in the application directory for multimodal.\n)\n```\n\nThen, to get the features for a specific image: \n```python\nfeats = features[image_id]\n```\n\nThe features have the following keys : \n```python\n{\n    \"image_id\": int,\n    \"image_w\": int,\n    \"image_h\" : int,\n    \"num_boxes\": int\n    \"boxes\": np.array(N, 4),\n    \"features\": np.array(N, 2048),\n}\n```\n\n#### Datasets\n```python\n# Datasets\ndataset = VQA(\n    dir_data=None,       # dir where multimodal data will be downloaded. Default is HOME/.multimodal\n    features=None,       # which visual features should be used. Choices: coco-bottomup or coco-bottomup-36\n    split=\"train\",       # \"train\", \"val\" or \"test\"\n    min_ans_occ=8,       # Minimum occurences to keep an answer.\n    dir_features=None,   # Specific directory for features. By default, they will be located in dir_data/features.\n    label=\"multilabel\",  # \"multilabel\", or \"best\". This changes the shape of the ground truth label (class number for best, or tensor of scores for multilabel)\n)\nitem = dataset[0]\n```\n\nThe `item` will contain the following keys : \n```python\n\u003e\u003e\u003e print(item.keys())\n{'image_id',\n'question_id',\n'question_type',\n'question',                 # full question (not tokenized, tokenization is done in the WordEmbedding class)\n'answer_type',              # yes/no, number or other\n'multiple_choice_answer',\n'answers',\n'image_id',\n'label',                    # either class label (if label=\"best\") or target class scores (tensor of N classes).\n'scores',                   # VQA scores for every answer\n}\n```\n\n\n\n#### Word embeddings\n\n```python\n# Word embedding from scratch, and trainable.\nwemb = Wordembedding(\n    tokens,   # Token list. We recommend using torchtext basic_english tokenizer.\n    dim=50,   # Dimension for word embeddings.\n    freeze=False   # freeze=True means that word embeddings will be set with `requires_grad=False`. \n)\n\n\n\nwemb = WordEmbedding.from_pretrained(\n    name=\"glove.840B.300d\", # embedding name (from torchtext)\n    tokens,                 # tokens to load from the word embedding.\n    max_tokens=None,        # if set to N, only the N most common tokens will be loaded.\n    freeze=True,            # same parameter as default model. \n    dir_data=None,          # dir where data will be downloaded. Default is multimodal directory in apps dir.\n)\n\n# Forward pass\nsentences = [\"How many people are in the picture?\", \"What color is the car?\"]\nwemb(\n    sentences, \n    tokenized=False  # set tokenized to True if sentence is already tokenized.\n)\n\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultimodal%2Fmultimodal","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmultimodal%2Fmultimodal","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmultimodal%2Fmultimodal/lists"}