{"id":28795558,"url":"https://github.com/ukplab/mmt-retrieval","last_synced_at":"2025-08-24T19:24:12.716Z","repository":{"id":57442464,"uuid":"348761736","full_name":"UKPLab/MMT-Retrieval","owner":"UKPLab","description":null,"archived":false,"fork":false,"pushed_at":"2022-12-10T07:08:17.000Z","size":162,"stargazers_count":131,"open_issues_count":2,"forks_count":14,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-06-18T03:09:18.670Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UKPLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-17T15:35:29.000Z","updated_at":"2025-02-09T12:59:00.000Z","dependencies_parsed_at":"2023-01-26T04:31:31.064Z","dependency_job_id":null,"html_url":"https://github.com/UKPLab/MMT-Retrieval","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/UKPLab/MMT-Retrieval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2FMMT-Retrieval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2FMMT-Retrieval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2FMMT-Retrieval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2FMMT-Retrieval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UKPLab","download_url":"https://codeload.github.com/UKPLab/MMT-Retrieval/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UKPLab%2FMMT-Retrieval/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260477931,"owners_count":23015066,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-18T03:09:18.506Z","updated_at":"2025-08-24T19:24:12.686Z","avatar_url":"https://github.com/UKPLab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MMT-Retrieval: Image Retrieval and more using Multimodal Transformers (OSCAR, UNITER, M3P \u0026 Co)\n\nThis project provides an easy way to use the recent pre-trained multimodal Transformers \nlike [OSCAR](https://github.com/microsoft/Oscar), [UNITER/ VILLA](https://github.com/zhegan27/VILLA) or [M3P (multilingual!)](https://github.com/microsoft/M3P)\nfor image search and more.\n\nThe code is primarily written for image-text retrieval.\nStill, many other Vision+Language tasks, beside image-text retrieval, should work out of the box using our code or require just small changes.\n\nThere is currently no unified approach for how the visual input is handled and each model uses their own slightly different approach.\nWe provide a common interface for all models and support for multiple feature file formats.\nThis greatly simplifies the process of running the models.\n\nOur project allows you to run a model in a few lines of code and offers easy fine-tuning of your own custom models.\n\nWe also provide our fine-tuned image-text-retrieval models for download, so you can get directly started.\nCheck out [our example for Image Search on MSCOCO using our fine-tuned models here](examples/applications/Image_Search.ipynb).\n\n## Citing \u0026 Authors\nIf you find this repository helpful, feel free to cite our publication [Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval](https://arxiv.org/abs/2103.11920):\n```\n@article{geigle:2021:arxiv,\n  author    = {Gregor Geigle and \n                Jonas Pfeiffer and \n                Nils Reimers and \n                Ivan Vuli\\'{c} and \n                Iryna Gurevych},\n  title     = {Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval},\n  journal   = {arXiv preprint},\n  volume    = {abs/2103.11920},\n  year      = {2021},\n  url       = {http://arxiv.org/abs/2103.11920},\n  archivePrefix = {arXiv},\n  eprint    = {2103.11920}\n}\n```\n\n\u003e **Abstract:** \n\u003e Current state-of-the-art approaches to cross-modal retrieval process text and \n\u003e visual input jointly, relying on Transformer-based architectures with \n\u003e cross-attention mechanisms that attend over all words and objects in an image. \n\u003e While offering unmatched retrieval performance, such models: \\textbf{1)} \n\u003e are typically pretrained from scratch and thus less scalable, \\textbf{2)} \n\u003e suffer from huge retrieval latency and inefficiency issues, which makes \n\u003e them impractical in realistic applications. To address these crucial gaps \n\u003e towards both improved and efficient cross-modal retrieval, we propose a novel \n\u003e fine-tuning framework which turns any pretrained text-image multi-modal model\n\u003e into an efficient retrieval model. The framework is based on a cooperative \n\u003e retrieve-and-rerank approach which combines: \\textbf{1)} twin networks to\n\u003e separately encode all items of a corpus, enabling efficient initial \n\u003e retrieval, and \\textbf{2)} a cross-encoder component for a more nuanced\n\u003e (i.e., smarter) ranking of the retrieved small set of items. \n\u003e We also propose to jointly fine-tune the two components with shared weights, \n\u003e yielding a more parameter-efficient model. Our experiments on a series of \n\u003e standard cross-modal retrieval benchmarks in monolingual, multilingual, \n\u003e and zero-shot setups, demonstrate improved accuracy and huge efficiency \n\u003e benefits over the state-of-the-art cross-encoders.\n\n\nDon't hesitate to send me an e-mail or report an issue, if something is broken or if you have further questions or feedback.\n\n\n\nContact person: Gregor Geigle, [gregor.geigle@gmail.com](mailto:gregor.geigle@gmail.com)\n\nhttps://www.ukp.tu-darmstadt.de/\n\nhttps://www.tu-darmstadt.de/\n\n\u003eThis repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.\n\n\n## Installation\nWe recommend **Python 3.6** or higher, **[PyTorch 1.6.0](https://pytorch.org/get-started/locally/)** or higher,\n**[transformers v4.1.1](https://github.com/huggingface/transformers)** or higher,\nand **[sentence-transformer 0.4.1](https://github.com/UKPLab/sentence-transformers)** or higher up to 1.2.1.\n\n\n**Install with pip**\n\nInstall `mmt-retrieval` with `pip`: \n```\npip install mmt-retrieval\n```\n\n**Install from sources**\n\nAlternatively, you can also clone the latest version from the [repository](https://github.com/UKPLab/MMT-Retrieval) and install it directly from the source code:\n````\npip install -e .\n```` \n\n**PyTorch with CUDA**\nIf you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow\n[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch.\n\n\n\n## Getting Started\nWith our repository, you can get started using the multimodal Transformers in a few lines of code.\nCheck out [our example for Image Search on MSCOCO using our fine-tuned models here](examples/applications/Image_Search.ipynb).\nOr go along with the following steps to get started with your own project.\n\n\n### Select the Model\nWe provide our fine-tuned Image-Text Retrieval models for download.\nWe also provide links to where to download the pre-trained models and models that are fine-tuned for other tasks.\n\nAlternatively, you can fine-tune your own model, too. See [here](#training) for more.\n#### Our Fine-Tuned Image-Text Retrieval Models\nWe publish our jointly trained fine-tuned models.\nThey can be used both to encode images and text in a multimodal embedding space \nand to cross-encode pairs for a pairwise similarity.\n\n| Model | URL |\n|-------|-----|\n| OSCAR (Flickr30k) | https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/oscar_join_flickr30k.zip |\n| OSCAR (MSCOCO) | https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/oscar_join_mscoco.zip |\n| M3P (Multi30k - en, de fr, cs) | https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/m3p_join_multi30k.zip |\n\n\n\n#### Other Pre-Trained or Fine-Tuned Transformer\nWe currently do not directly support downloading of the different pre-trained Transformer models.\nPlease manually download them using the links in the respective repositories:\n[OSCAR](https://github.com/microsoft/Oscar), [UNITER/ VILLA](https://github.com/zhegan27/VILLA), [M3P](https://github.com/microsoft/M3P).\nWe present [here](#training) examples on how to initialize your own models with the pre-trained Transformers.\n\nOSCAR provides many already fine-tuned models for different tasks for download (see their MODEL_ZOO.md).\nWe provide the ability to convert those models to our framework so you can quickly  start using them.\n````python\nfrom mmt_retrieval.util import convert_finetuned_oscar\n\ndownloaded_folder_path = \".../oscar-base-ir-finetune/checkpoint-29-132780\"\nconverted_model = convert_finetuned_oscar(downloaded_folder_path)\nconverted_model.save(\"new_save_location_for_converted_model\")\n````\n\n\n### Step 0: Image Feature Pre-Processing\nAll currently supported models require a pre-processing step\nwhere we extract the regions of interest (which serve as image input analog to tokens for the language input) from the images using a Faster R-CNN object detection model.\n\nWhich detection model is needed, depends on the model that you are using.\nCheck out [our guide](documentation/image_features.md) where we have gathered all needed information to get startet.\n\nIf available, we also point to already pre-processed image features that can be downloaded for a quicker start.\n\n#### Loading Features and Image Input\nWe load image features in a dictionary-like object (`model.image_dict`) at the start.\nWe support various different storage formats for the features (see the guide above).\nEach image is uniquely identified by its image id in this dictionary.\n\nThe advantage of the dictionary approach is that we can designate the image input by its id which is then internally\nresolved to the features.\n\n\n#### Loading Features Just-In-Time (RAM Constraints)\nThe image features require a lot of additional memory.\nFor this reason, we support just-in-time loading of the features from disc.\nThis requires one feature file for each image. \nMany of the downloadable features are saved in a single file.\nWe provide code to split those big files in separate files, one for each image.\n\n````python\nfrom mmt_retrieval.util import split_oscar_image_feature_file_to_npz, split_tsv_features_to_npz\n````\n\n\n### Step 1: Getting Started\nThe following is an example showcasing all steps needed to get started encoding multimodal inputs with our code.\n\n````python\nfrom mmt_retrieval import MultimodalTransformer\n\n# Loading a jointly trained model that can both embed and cross-encode multimodal input\nmodel_path = \"https://public.ukp.informatik.tu-darmstadt.de/reimers/mmt-retrieval/models/v1/oscar_join_flickr30k.zip\"\nmodel = MultimodalTransformer(model_name_or_path=model_path)\n\n# Image ids are the unique identifier number (as string) of each image. If you save the image features separately for each image, this would be the file name\nimage_ids = [\"0\", \"1\", \"5\"]\n# We must load the image features in some way before we can use the model\n# Refer to Step 0 on more details for how to generate the features\nfeature_folder = \"path/to/processed/features\"\n# Directly load the features from disc. Requires more memory. \n# Increase max_workers for more concurrent threads for faster loading with many features\n# Remove select to load the entire folder\nmodel.image_dict.load_features_folder(feature_folder, max_workers=1, select=image_ids)\n## OR\n# Only load the file paths so that features are loaded later just-in-time when there are required.\n# Recommended with restricted memory and/ or a lot of images\n# Remove select to load the entire folder\nmodel.image_dict.load_file_names(feature_folder, select=image_ids)\n\nsentences = [\"The red brown fox jumped over the fence\", \"A dog being good\"]\n\n# Get Embeddings (as a list of numpy arrays)\nsentence_embeddings = model.encode(sentences=sentences, convert_to_numpy=True) # convert_to_numpy=True is default\nimage_embeddings = model.encode(images=image_ids, convert_to_numpy=True)\n\n# Get Pairwise Similarity Matrix (as a tensor)\nsimilarities = model.encode(sentences=sentences, images=image_ids, output_value=\"logits\", convert_to_tensor=True, cross_product_input=True)\nsimilarities = similarities[:,-1].reshape(len(image_ids), len(sentences))\n````\n\n\n## Experiments and Training\n\u003ca name=\"training\"\u003e\u003c/a\u003e\n\nSee [our examples](examples/experiments/README.md) to learn how to fine-tune and evaluate the multimodal Transformers.\nWe provide instructions for fine-tuning your own models with our image-text retrieval setup, show how to replicate our experiments,\nand give pointers on how to train your own models, potentially beyond image-text retrieval.\n\n\n### Expected Results with our Fine-Tuned Models\nWe report the JOIN+CO (,i.e., retrieve \u0026 re-rank with a jointly trained model) results of our published models\nRefer to our publications for more detailed results.\n\nImage Retrieval for MSCOCO/ Flickr30k:\n\n| Model                | Dataset  |      |      |      |\n|----------------------|----------|------|------|------|\n|                      |           | R@1  | R@5  | R@10 |\n| oscar-join-mscoco    |    MSCOCO (5k images) | 54.7 | 81.3 | 88.9 |\n| oscar-join-flickr30k | Flickr30k (1k images) | 76.4 | 93.6 | 96.2 |\n\nMultilingual Image Retrieval for Multi30k (in mR):\n\n| Model                | en        | de   | fr   | cs   |\n|----------------------|-----------|------|------|------|\n| m3p-join-multi30k    |        83.0 | 79.2 | 75.9 |   74 |","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fukplab%2Fmmt-retrieval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fukplab%2Fmmt-retrieval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fukplab%2Fmmt-retrieval/lists"}