{"id":13575128,"url":"https://github.com/kohjingyu/fromage","last_synced_at":"2026-02-02T13:55:52.907Z","repository":{"id":65624870,"uuid":"593850163","full_name":"kohjingyu/fromage","owner":"kohjingyu","description":"🧀 Code and models for the ICML 2023 paper \"Grounding Language Models to Images for Multimodal Inputs and Outputs\".","archived":false,"fork":false,"pushed_at":"2023-10-30T17:11:45.000Z","size":41888,"stargazers_count":477,"open_issues_count":5,"forks_count":35,"subscribers_count":12,"default_branch":"main","last_synced_at":"2024-11-05T10:45:49.502Z","etag":null,"topics":["computer-vision","large-language-models","machine-learning","natural-language-processing"],"latest_commit_sha":null,"homepage":"https://jykoh.com/fromage","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kohjingyu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-01-27T01:03:38.000Z","updated_at":"2024-11-05T08:51:02.000Z","dependencies_parsed_at":"2023-09-29T03:30:08.613Z","dependency_job_id":"b35a36a5-7191-49fd-89d4-0729d334f123","html_url":"https://github.com/kohjingyu/fromage","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kohjingyu%2Ffromage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kohjingyu%2Ffromage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kohjingyu%2Ffromage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kohjingyu%2Ffromage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kohjingyu","download_url":"https://codeload.github.com/kohjingyu/fromage/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247237672,"owners_count":20906325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","large-language-models","machine-learning","natural-language-processing"],"created_at":"2024-08-01T15:00:58.485Z","updated_at":"2026-02-02T13:55:52.879Z","avatar_url":"https://github.com/kohjingyu.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"readme":"# Grounding Language Models to Images for Multimodal Inputs and Outputs\n\n\u003cp align=\"center\"\u003e\n\u003cimg alt=\"FROMAGe model architecture\" src=\"./teaser.png\" width=\"90%\"\u003e\n\u003cbr/\u003e\u003cbr/\u003e\n\u003cimg alt=\"FROMAGe chat animation\" src=\"./teaser_gif.gif\" width=\"40%\"\u003e\n\u003c/p\u003e\n\nThis repository hosts the code and model weights for FROMAGe.\n\n[Paper](https://arxiv.org/abs/2301.13823) | [Project Webpage](https://jykoh.com/fromage) | [Demo](https://huggingface.co/spaces/jykoh/fromage)\n\n\n## Setup instructions\n\n### Environment\nSet up a new virtualenv, and install required libraries:\n```\npython -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\nAdd the `fromage` library to PYTHONPATH:\n```\nexport PYTHONPATH=$PYTHONPATH:/home/path/to/fromage/\n```\n\n### Pretrained Checkpoints\n\nThe FROMAGe model weights (linear layers and [RET] embedding) are small (around 11MB), and are included in this Git repo. They will be in the `fromage_model/` folder after cloning. The checkpoint and model config in `fromage_model/` reproduce the results reported in our paper.\n\nWe have also included a second model trained with a stronger visual linear layer (4 visual tokens instead of 1), located at `fromage_model/fromage_vis4`. This model generally does better on dialogue settings and does not require as much tuning of inference time hyperparameters, as it is able to better represent more complex images.\n\n### Precomputed Embeddings For Image Retrieval\n\nThe visual embeddings for Conceptual Captions images with valid URLs are precomputed and stored at this [URL](https://drive.google.com/file/d/1wMojZNqEwApNlsCZVvSgQVtZLgbeLoKi/view?usp=share_link). These are used to enable the model to retrieve images. The embeddings take up around 3GB, and are compatible with both model configs we provide. Download the files and place `cc3m_embeddings.pkl` into the `fromage_model/` directory.\n\nIf you wish to precompute these embeddings for a different set of image URLs or for a different model, edit `fromage/extract_img_embs.py` with the list of image URLs and run it:\n\n```python fromage/extract_img_embs.py```\n\n\n## Inference\n\nCheck out `FROMAGe_example_notebook.ipynb` for examples on calling the model for inference. Several of the figures presented in the paper are reproduced in this notebook using greedy decoding of the model. Note that there may be minor differences in image outputs due to CC3M images being lost over time.\n\n\n## Training\n\n### Preparing CC3M\n\nOur model is trained on the [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions) dataset. After following the instructions on the website to download the captions and images, format it into a `.tsv` file as follows:\n\n```\ncaption image\nA picture of a cat  cat.png\nMountains  mountain.png\n```\nwhere each line contains the caption followed by the filename of the image files. Save these `.tsv` files into the `dataset/` folder (the default names expected are `cc3m_train.tsv` and `cc3m_val.tsv`). The repo contains two placeholder files, and you will have to replace them with the appropriate data.\n\nThe corresponding image files should be saved in the `data/` directory. The directory can be changed with the `--image-dir` runtime flag.\n\n\n### Training FROMAGe\n\nAfter preparing CC3M as detailed above, you can start a new training job with the following command line flag:\n\n```\nrandport=$(shuf -i8000-9999 -n1)  # Generate a random port number\npython -u main.py \\\n    --dist-url \"tcp://127.0.0.1:${randport}\" --dist-backend 'nccl' \\\n    --multiprocessing-distributed --world-size 1 --rank 0 \\\n    --dataset=cc3m  --val-dataset=cc3m \\\n    --opt-version='facebook/opt-6.7b' --visual-model='openai/clip-vit-large-patch14' \\\n    --exp_name='fromage_exp' --image-dir='data/'  --log-base-dir='runs/' \\\n    --batch-size=180  --val-batch-size=100  --learning-rate=0.0003 --precision='bf16'  --print-freq=100\n```\n\nOn a single A6000 GPU, the model converges within 24 hours (with a batch size of 180). For GPUs with smaller memory available, you might need to reduce the batch size, enable gradient accumulation, or adjust hyperparameters to get good performance. You may also have to disable NCCL P2P with `export NCCL_P2P_DISABLE=1` if you run into issues.\n\n\n### Pruning Model Weights\n\nAs FROMAGe only consists of a few pretrained linear layers and the `[RET]` embedding, we can discard most of the pretrained weights to save on disk space. If you have trained a new model, and wish to do so, you can use `fromage/prune_model_ckpt.py` to prune the model weights. We used the same script to create the weights in the `fromage_model` directory.\n\n\n### Unit Tests\n\nYou can also test that the code runs locally by running the unit test with `pytest -x`. This runs a short training and evaluation job, with smaller models, to ensure the code works. The test should complete within approximately 90s.\n\nNote that because of exception catching (to ensure data errors don't terminate training), the test will silently fail and not terminate if there is an I/O error when reading data. Hence, we recommend running the Python command above for debugging data preprocessing.\n\n\n## Evaluation\n\nWe provide an evaluation script to reproduce our results on contextual image retrieval on Visual Storytelling (results of Table 1 of our paper). The script can be run from `evals/eval_vist_retrieval.py`. There is also a iPython notebook version (`VIST_Contextual_Image_Retrieval.ipynb`) in the same directory.\n\nSimilarly, we provide scripts to reproduce the text generation and image retrieval results on VisDial (presented in Table 2 of our paper). The script for VisDial text generation can be run from `evals/eval_visdial_generation.py` (or through the notebook version, `VisDial_Inference_IT2T_Generation.ipynb`). This reports the NDCG, MRR, and R@k scores for VisDial.\n\nThe results for image retrieval can be reproduced by running the `evals/eval_visdial_retrieval.py` script (or through the notebook version `VisDial_Inference_T2I_Retrieval.ipynb`), which reports R@k scores.\n\n\n## Gradio Demo\n\nYou can launch your own version of the Gradio demo locally by running `python demo/app.py`, or duplicating the [HuggingFace space](https://huggingface.co/spaces/jykoh/fromage).\n\nCheck out other unofficial HuggingFace spaces for FROMAGe:\n- [alvanlii FROMAGe demo](https://huggingface.co/spaces/alvanlii/FROMAGe)\n\n\n## Citation\n\nIf you find this work useful, please consider citing:\n\n```\n@inproceedings{koh2023grounding,\n  title={Grounding Language Models to Images for Multimodal Inputs and Outputs},\n  author={Koh, Jing Yu and Salakhutdinov, Ruslan and Fried, Daniel},\n  journal={ICML},\n  year={2023}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkohjingyu%2Ffromage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkohjingyu%2Ffromage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkohjingyu%2Ffromage/lists"}