{"id":13547682,"url":"https://github.com/YehLi/xmodaler","last_synced_at":"2025-04-02T20:30:39.943Z","repository":{"id":39163453,"uuid":"380299157","full_name":"YehLi/xmodaler","owner":"YehLi","description":"X-modaler is a versatile and high-performance codebase for cross-modal analytics(e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval).","archived":false,"fork":false,"pushed_at":"2023-02-27T18:28:38.000Z","size":12754,"stargazers_count":970,"open_issues_count":16,"forks_count":105,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-03-27T22:12:24.916Z","etag":null,"topics":["cross-modal-retrieval","image-captioning","pretraining","tden","video-captioning","vision-and-language","visual-question-answering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YehLi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2021-06-25T16:42:09.000Z","updated_at":"2025-03-05T16:31:36.000Z","dependencies_parsed_at":"2024-01-14T03:46:36.784Z","dependency_job_id":"61418e44-3ecf-4af1-b6b2-c5f19dedf263","html_url":"https://github.com/YehLi/xmodaler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YehLi%2Fxmodaler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YehLi%2Fxmodaler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YehLi%2Fxmodaler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YehLi%2Fxmodaler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YehLi","download_url":"https://codeload.github.com/YehLi/xmodaler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246887780,"owners_count":20850140,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-modal-retrieval","image-captioning","pretraining","tden","video-captioning","vision-and-language","visual-question-answering"],"created_at":"2024-08-01T12:00:59.576Z","updated_at":"2025-04-02T20:30:38.081Z","avatar_url":"https://github.com/YehLi.png","language":"Python","funding_links":[],"categories":["New Large-Scale Datasets"],"sub_categories":["Libraries"],"readme":"# X-modaler\n[X-modaler](https://xmodaler.readthedocs.io/en/latest/) is a versatile and high-performance codebase for cross-modal analytics (e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval). This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.\n\nThe original paper can be found [here](https://arxiv.org/pdf/2108.08217.pdf).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/task.jpg\" width=\"800\"/\u003e\n\u003c/p\u003e\n\n## Installation\nSee [installation instructions](https://xmodaler.readthedocs.io/en/latest/tutorials/installation.html).\n\n### Requiremenets\n* Linux or macOS with Python ≥ 3.6\n* PyTorch ≥ 1.8 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this\n* fvcore\n* pytorch_transformers\n* jsonlines\n* pycocotools\n\n## Getting Started \nSee [Getting Started with X-modaler](https://xmodaler.readthedocs.io/en/latest/tutorials/getting_started.html)\n\n### Training \u0026 Evaluation in Command Line\n\nWe provide a script in \"train_net.py\", that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.\n\nTo train a model(e.g., UpDown) with \"train_net.py\", first setup the corresponding datasets following [datasets](xmodaler/datasets/README.md), then run:\n```\n# Teacher Force\npython train_net.py --num-gpus 4 \\\n \t--config-file configs/image_caption/updown.yaml\n\n# Reinforcement Learning\npython train_net.py --num-gpus 4 \\\n \t--config-file configs/image_caption/updown_rl.yaml\n```\n\n## Model Zoo and Baselines\nA large set of baseline results and trained models are available [here](https://xmodaler.readthedocs.io/en/latest/notes/benchmarks.html).\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd colspan=\"4\" align=\"center\"\u003e\u003cfont size=3\u003e\u003cb\u003eImage Captioning\u003c/b\u003e\u003c/font\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eAttention\u003c/td\u003e\n    \u003ctd\u003e Show, attend and tell: Neural image caption generation with visual attention \u003c/td\u003e\n    \u003ctd\u003eICML\u003c/td\u003e\n    \u003ctd\u003e2015\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eLSTM-A3\u003c/td\u003e\n    \u003ctd\u003e Boosting image captioning with attributes \u003c/td\u003e\n    \u003ctd\u003eICCV\u003c/td\u003e\n    \u003ctd\u003e2017\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eUp-Down\u003c/td\u003e\n    \u003ctd\u003e Bottom-up and top-down attention for image captioning and visual question answering \u003c/td\u003e\n    \u003ctd\u003eCVPR\u003c/td\u003e\n    \u003ctd\u003e2018\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eGCN-LSTM\u003c/td\u003e\n    \u003ctd\u003e Exploring visual relationship for image captioning \u003c/td\u003e\n    \u003ctd\u003eECCV\u003c/td\u003e\n    \u003ctd\u003e2018\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTransformer\u003c/td\u003e\n    \u003ctd\u003e Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning \u003c/td\u003e\n    \u003ctd\u003eACL\u003c/td\u003e\n    \u003ctd\u003e2018\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eMeshed-Memory\u003c/td\u003e\n    \u003ctd\u003e Meshed-Memory Transformer for Image Captioning \u003c/td\u003e\n    \u003ctd\u003eCVPR\u003c/td\u003e\n    \u003ctd\u003e2020\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eX-LAN\u003c/td\u003e\n    \u003ctd\u003e X-Linear Attention Networks for Image Captioning \u003c/td\u003e\n    \u003ctd\u003eCVPR\u003c/td\u003e\n    \u003ctd\u003e2020\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd colspan=\"4\" align=\"center\"\u003e\u003cfont size=3\u003e\u003cb\u003eVideo Captioning\u003c/b\u003e\u003c/font\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eMP-LSTM\u003c/td\u003e\n    \u003ctd\u003e Translating Videos to Natural Language Using Deep Recurrent Neural Networks \u003c/td\u003e\n    \u003ctd\u003eNAACL HLT\u003c/td\u003e\n    \u003ctd\u003e2015\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTA\u003c/td\u003e\n    \u003ctd\u003e Describing Videos by Exploiting Temporal Structure \u003c/td\u003e\n    \u003ctd\u003eICCV\u003c/td\u003e\n    \u003ctd\u003e2015\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTransformer\u003c/td\u003e\n    \u003ctd\u003e Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning \u003c/td\u003e\n    \u003ctd\u003eACL\u003c/td\u003e\n    \u003ctd\u003e2018\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTDConvED\u003c/td\u003e\n    \u003ctd\u003e Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning \u003c/td\u003e\n    \u003ctd\u003eAAAI\u003c/td\u003e\n    \u003ctd\u003e2019\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd colspan=\"4\" align=\"center\"\u003e\u003cfont size=3\u003e\u003cb\u003eVision-Language Pretraining\u003c/b\u003e\u003c/font\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eUniter\u003c/td\u003e\n    \u003ctd\u003e UNITER: UNiversal Image-TExt Representation Learning \u003c/td\u003e\n    \u003ctd\u003eECCV\u003c/td\u003e\n    \u003ctd\u003e2020\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTDEN\u003c/td\u003e\n    \u003ctd\u003e Scheduled Sampling in Vision-Language Pretraining\nwith Decoupled Encoder-Decoder Network \u003c/td\u003e\n    \u003ctd\u003eAAAI\u003c/td\u003e\n    \u003ctd\u003e2021\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n\n#### Image Captioning on MSCOCO (Cross-Entropy Loss)\n| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| LSTM-A3 | [GoogleDrive](https://drive.google.com/file/d/13fJVIK7ZgQnNMWzIbFicETDx6AgLg0NH/view?usp=sharing)| 75.3 | 59.0 | 45.4 | 35.0 | 26.7 | 55.6 | 107.7|  19.7 |\n| Attention | [GoogleDrive](https://drive.google.com/file/d/1aw8lPcDlf8C8UPsphwqbMAsq5-YSHIEf/view?usp=sharing) | 76.4 | 60.6 | 46.9 | 36.1 | 27.6 | 56.6 | 113.0 | 20.4 |\n| Up-Down | [GoogleDrive](https://drive.google.com/file/d/1giOJ5llaNjXz2JClN3Mqe93VIy1Fu5pq/view?usp=sharing) | 76.3 | 60.3 | 46.6 | 36.0 | 27.6 | 56.6 | 113.1 | 20.7 |\n| GCN-LSTM | [GoogleDrive](https://drive.google.com/file/d/1eLZqt2xS32lUOQibxEDclwANMtska4L9/view?usp=sharing) |76.8 | 61.1 | 47.6 | 36.9 | 28.2 | 57.2 | 116.3 | 21.2 |\n| Transformer | [GoogleDrive](https://drive.google.com/file/d/1Q6Tt2z_NKmnr0ai0uRRNyap2-DxxM7Wy/view?usp=sharing) | 76.4 | 60.3 | 46.5 | 35.8|28.2|56.7| 116.6| 21.3 |\n| Meshed-Memory | [GoogleDrive](https://drive.google.com/file/d/1i4JZ8rbLiWRGtCs8wdRG047pbZA-BL2x/view?usp=sharing) | 76.3 | 60.2 | 46.4 | 35.6 | 28.1 | 56.5 | 116.0 | 21.2 |\n| X-LAN | [GoogleDrive](https://drive.google.com/file/d/1zgUWEDD7EiRyih8G_DyE6unshjKjeKjV/view?usp=sharing) | 77.5 | 61.9 | 48.3 | 37.5 | 28.6 | 57.6 | 120.7 | 21.9 |\n| TDEN | [GoogleDrive](https://drive.google.com/file/d/19alfPj-gIudoL5CHsS4VwhfnU-FhTXW3/view?usp=sharing) | 75.5 | 59.4 | 45.7 | 34.9 | 28.7 | 56.7 | 116.3 | 22.0 |\n\n#### Image Captioning on MSCOCO (CIDEr Score Optimization)\n| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| LSTM-A3 | [GoogleDrive](https://drive.google.com/file/d/1KELHgYpBh5lsIiQ9yb9o127tea8_nbHo/view?usp=sharing)| 77.9 | 61.5| 46.7| 35.0| 27.1| 56.3| 117.0| 20.5 |\n| Attention | [GoogleDrive](https://drive.google.com/file/d/1m04qezTUJpdkBI3oIo_5Y9fIZG7_jZ2S/view?usp=sharing) | 79.4| 63.5| 48.9| 37.1| 27.9| 57.6| 123.1| 21.3 |\n| Up-Down | [GoogleDrive](https://drive.google.com/file/d/1tHM06k413ANuAr7a5jCAtKeN_lQ-ieBk/view?usp=sharing) | 80.1| 64.3| 49.7| 37.7| 28.0| 58.0| 124.7| 21.5 |\n| GCN-LSTM | [GoogleDrive](https://drive.google.com/file/d/1qwilTeK2WQCZEDXcJAmmteLZfLOEhg7P/view?usp=sharing) | 80.2| 64.7| 50.3| 38.5| 28.5| 58.4| 127.2| 22.1 |\n| Transformer | [GoogleDrive](https://drive.google.com/file/d/1y3E4t5pQUuvN_gB_tgBVX9HvzM5QSex5/view?usp=sharing) | 80.5| 65.4| 51.1| 39.2| 29.1| 58.7| 130.0| 23.0 |\n| Meshed-Memory | [GoogleDrive](https://drive.google.com/file/d/1GkvwhTzjGQG4fUbCl1-N_TFd8HowOnfy/view?usp=sharing) | 80.7| 65.5| 51.4| 39.6| 29.2| 58.9| 131.1| 22.9 |\n| X-LAN | [GoogleDrive](https://drive.google.com/file/d/13b6nhbnq4h8JKbS0oQB_F2tnRUiUt5g-/view?usp=sharing) | 80.4| 65.2| 51.0| 39.2| 29.4| 59.0| 131.0| 23.2 |\n| TDEN | [GoogleDrive](https://drive.google.com/file/d/1GTbbwfbJHIu6uDmcLY-pedCiuWHyR7nK/view?usp=sharing) | 81.3| 66.3| 52.0| 40.1| 29.6| 59.8| 132.6| 23.4 |\n\n#### Video Captioning on MSVD\n| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| MP-LSTM | [GoogleDrive](https://drive.google.com/file/d/1NDjaCyBntQZI3ehQ8QyUMTMrb1e6Dgsp/view?usp=sharing)| 77.0 | 65.6 | 56.9 | 48.1 | 32.4 | 68.1 | 73.1 | 4.8 |\n| TA | [GoogleDrive](https://drive.google.com/file/d/1SqvugATqHU3Le1jtTQKnL3FADJ7kbJK0/view?usp=sharing)| 80.4 | 68.9 | 60.1 | 51.0 | 33.5 | 70.0 | 77.2 | 4.9 | \n| Transformer | [GoogleDrive](https://drive.google.com/file/d/1NlwZrAhGE9RPbWdypVz-Tkirt4u8E1t0/view?usp=sharing)| 79.0 | 67.6 | 58.5 | 49.4 | 33.3 | 68.7 | 80.3 | 4.9 |\n| TDConvED | [GoogleDrive](https://drive.google.com/file/d/1Th9FJe8o_4bMULuoCKqDHP_4Faa0RabZ/view?usp=sharing)| 81.6 | 70.4 | 61.3 | 51.7 | 34.1 | 70.4 | 77.8 | 5.0 |\n\n#### Video Captioning on MSR-VTT\n| Name | Model | BLEU@1 | BLEU@2 | BLEU@3 | BLEU@4 | METEOR | ROUGE-L | CIDEr-D | SPICE |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| MP-LSTM | [GoogleDrive](https://drive.google.com/file/d/1OBhtruTexuYV_MbiUL4obfUoNKZbEiUd/view?usp=sharing)| 73.6 | 60.8 | 49.0 | 38.6 | 26.0 | 58.3 | 41.1 | 5.6  |\n| TA | [GoogleDrive](https://drive.google.com/file/d/126nPL9lC6_Qa6_hMs32V1zSsJSDxpR9-/view?usp=sharing)| 74.3 | 61.8 | 50.3 | 39.9 | 26.4 | 59.4 | 42.9 | 5.8  | \n| Transformer | [GoogleDrive](https://drive.google.com/file/d/1OEYQb4521fYlr40uQRn0sQb4eMsrtoNR/view?usp=sharing) | 75.4 | 62.3 | 50.0 | 39.2 | 26.5 | 58.7 | 44.0 | 5.9  |\n| TDConvED | [GoogleDrive](https://drive.google.com/file/d/1A3OGvjCpXUI6p1vy1qbNTVGLy5a0b3Dc/view?usp=sharing)| 76.4 | 62.3 | 49.9 | 38.9 | 26.3 | 59.0 | 40.7 | 5.7  |\n\n#### Visual Question Answering\n| Name | Model | Overall | Yes/No | Number | Other |\n| :---: | :---: | :---: | :---: | :---: | :---: |\n| Uniter | [GoogleDrive](https://drive.google.com/file/d/1cjBAeYSuSEN_IlQCnqtIoalkATMSQs87/view?usp=sharing) | 70.1 | 86.8 | 53.7 | 59.6 |\n| TDEN | [GoogleDrive](https://drive.google.com/file/d/1hwcDUboyCXghETamS_APJL8eGKY9OgFD/view?usp=sharing) | 71.9 | 88.3 | 54.3 | 62.0 |\n\n#### Caption-based image retrieval on Flickr30k\n| Name | Model | R1 | R5 | R10 | \n| :---: | :---: | :---: | :---: | :---: |\n| Uniter | [GoogleDrive](https://drive.google.com/file/d/1hvoWMmHjSvxp3zqW10L7PoBQGbxM9MiF/view?usp=sharing) |61.6 | 87.7 |92.8|\n| TDEN | [GoogleDrive](https://drive.google.com/file/d/1SqYscN6UCbifxhMJ-ScpiLgWepMSx7uq/view?usp=sharing) | 62.0 | 86.6 | 92.4 |\n\n#### Visual commonsense reasoning\n| Name | Model | Q -\u003e A | QA -\u003e R | Q -\u003e AR | \n| :---: | :---: | :---: | :---: | :---: |\n| Uniter | [GoogleDrive](https://drive.google.com/file/d/1Edx9uorwDgI5nZRf9M3XJDRIIoRa5TmP/view?usp=sharing) | 73.0 | 75.3 | 55.4 |\n| TDEN | [GoogleDrive](https://drive.google.com/file/d/1WZfvo_PyHQwdO-DU_GRWWjbKSzwfyBFO/view?usp=sharing) | 75.0 | 76.5 | 57.7 |\n\n## License\nX-modaler is released under the [Apache License, Version 2.0](LICENSE).\n\n## Citing X-modaler\nIf you use X-modaler in your research, please use the following BibTeX entry.\n\n```BibTeX\n@inproceedings{Xmodaler2021,\n  author =       {Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei},\n  title =        {X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics},\n  booktitle =    {Proceedings of the 29th ACM international conference on Multimedia},\n  year =         {2021}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYehLi%2Fxmodaler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FYehLi%2Fxmodaler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYehLi%2Fxmodaler/lists"}