{"id":17187496,"url":"https://github.com/davidmchan/caption-by-committee","last_synced_at":"2025-04-13T19:08:00.433Z","repository":{"id":67163703,"uuid":"577991958","full_name":"DavidMChan/caption-by-committee","owner":"DavidMChan","description":"Using LLMs and pre-trained caption models for super-human performance on image captioning.","archived":false,"fork":false,"pushed_at":"2023-10-13T21:43:38.000Z","size":7788,"stargazers_count":40,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T09:46:10.945Z","etag":null,"topics":["ai","captioning","chatgpt","deep-learning","image","machine-learning","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DavidMChan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-12-14T01:59:10.000Z","updated_at":"2024-07-25T13:48:25.000Z","dependencies_parsed_at":"2023-10-14T22:32:21.192Z","dependency_job_id":null,"html_url":"https://github.com/DavidMChan/caption-by-committee","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidMChan%2Fcaption-by-committee","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidMChan%2Fcaption-by-committee/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidMChan%2Fcaption-by-committee/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DavidMChan%2Fcaption-by-committee/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DavidMChan","download_url":"https://codeload.github.com/DavidMChan/caption-by-committee/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248766733,"owners_count":21158301,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","captioning","chatgpt","deep-learning","image","machine-learning","python"],"created_at":"2024-10-15T01:06:32.102Z","updated_at":"2025-04-13T19:08:00.401Z","avatar_url":"https://github.com/DavidMChan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IC3: Image Captioning by Committee Consensus\n\n![Method overview diagram](https://raw.githubusercontent.com/DavidMChan/caption-by-committee/main/assets/method-v2.png)\n\nThis is the implementation of the paper [IC3: Image Captioning by Committee Consensus](https://arxiv.org/abs/2302.01328).\n\n## Installation\n\nThe library can be installed with:\n\n```bash\n# Install LAVIS for BLIP/BLIP2 support\n$ pip install salesforce-lavis\n# Install the local directory with setuptools\n$ pip install .\n# For the metrics, we need to download and install a spacy model\n$ python -m spacy download en_core_web_lg\n```\n\nNext, we need to set up environment variables with API keys, if you want to use those API keys\n\n```bash\n# For OpenAI-based models, specify the following keys:\nexport OPENAI_API_KEY=\u003capi key\u003e\nexport OPENAI_API_ORG=\u003corg\u003e\n\n# For Huggingface Inference engine models, specify the following keys:\nexport HUGGINGFACE_API_KEY=\u003capi key\u003e\n```\n\nThe repository can be tested by running `cbc caption test/test_image.jpg`, which should produce a sample caption using\nthe OFA and GPT-2 models.\n\n## Running the model using the CLI\n\nTo run the model using the CLI, you can use:\n\n```bash\n$ cbc caption \u003cimage path\u003e\n```\n\nIf you have a full dataset of examples, you can use:\n\n```bash\n$ cbc evaluate-dataset \u003cdataset json\u003e\n```\n\nWhere the JSON format (minimally) looks like:\n\n```json\n[\n    {\n        \"references\": [\"List\", \"of\", \"references\"],\n        \"image_path\": \"Relative path to image\"\n    },\n    ...\n]\n```\n\nFor more details on these commands, see `cbc caption --help` and `cbc evalaute-dataset --help`.\n\n## Using the python API\n\nTo use the python API, see the following minimal example using GPT3 and OFA:\n\n```python\nfrom cbc.caption import OFACaptionEngine\nfrom cbc.caption_by_committee import caption_by_committee\nfrom cbc.lm import GPT3Davinci3\n\ndef run_caption() -\u003e None:\n    # Load the image\n    image = Image.open(\"coco_test_images/COCO_val2014_000000165547.jpg\").convert(\"RGB\")\n\n    # Construct a captioning engine (see: cbc/caption/__init__.py for available engines)\n    caption_engine = OFACaptionEngine(device=\"cuda:1\")\n\n    # Construct a language model engine (see cbc/lm/__init__.py for available engines)\n    lm_engine = GPT3Davinci3()\n\n    # Generate the caption\n    caption = caption_by_committee(\n        image,\n        caption_engine=caption_engine,\n        lm_engine=lm_engine,\n        caption_engine_temperature=1.0,\n        n_captions=15,\n    )\n\n    print(caption)\n\n```\n\n## Available Captioning/LM Engines\n\nThe following captioning and language models are available for use with this library:\n\n### Captioning\n\nBLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation\n\n-   \"blip\"\n-   \"blip-base\"\n\nBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models\n\n-   \"blip2\"\n-   \"blip2-base\"\n-   \"blip2-t5\"\n-   \"blip2-t5-xl\"\n\nOFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework\n\n-   \"ofa\"\n\nSocratic Models: Composing Zero-Shot Multimodal Reasoning with Language\n\n-   \"socratic-models\"\n\nChatCaptioner: ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions ([url](https://github.com/Vision-CAIR/ChatCaptioner/tree/main/ChatCaptioner))\n- \"chatcaptioner\"\n\n\n\n### Language Modeling\n\nOpenAI (Requires setting the `OPENAI_API_KEY` and `OPENAI_API_ORG` environment variables):\n\n-   \"gpt4\" (GPT-4 Chat model)\n-   \"gpt432k\" (GPT-4 32k Context Chat model)\n-   \"chatgpt\" (GPT-3.5-Turbo Chat model)\n-   \"gpt3_davinci3\" (GPT-3 Davinci v3 Completion model)\n-   \"gpt3_davinci2\" (GPT-3 Davinci v2 Completion model)\n-   \"gpt3_curie\" (GPT-3 Curie Completion model)\n-   \"gpt3_babbage\" (GPT-3 Babbage Completion model)\n-   \"gpt3_ada\" (GPT-3 Ada Completion model)\n\nHuggingface (Requires setting the `HUGGINGFACE_API_KEY` environment variable):\n\n-   \"bloom\" (Bloom 175B model)\n-   \"opt\" (OPT 66B model)\n\nHuggingface (No API key required):\n\n-   \"gpt2\" (GPT-2 model)\n-   \"gpt2_med\" (GPT-2 Medium model)\n-   \"gpt2_lg\" (GPT-2 Large model)\n-   \"gpt2_xl\" (GPT-2 XL model)\n-   \"distilgpt2\" (DistilGPT-2 model)\n-   \"gpt_neo_125m\" (GPT-Neo 125M model)\n-   \"gpt_neo_1b\" (GPT-Neo 1.3B model)\n-   \"gpt_neo_2b\" (GPT-Neo 2.7B model)\n-   \"gpt_j_6b\" (GPT-J 6B model)\n\nSummary Models:\n\n-   \"t5_small\" (T5 Small model)\n-   \"pegasus\" (Pegasus model)\n\nLLaMA: Open and Efficient Foundation Language Models (Requires setting the `HUGGINGFACE_LLAMA_WEIGHTS_ROOT` environment variable and preprocessing the weights according to [this url](https://huggingface.co/docs/transformers/main/model_doc/llama).):\n\n-   \"llama_7B\" (LLaMA 7B model)\n-   \"llama_13B\" (LLaMA 13B model)\n-   \"llama_30B\" (LLaMA 30B model)\n-   \"llama_65B\" (LLaMA 65B model)\n\nAlpaca: A Strong, Replicable Instruction-Following Model (Requires setting the `HUGGINGFACE_ALPACA_WEIGHTS_ROOT` environment variable and preprocessing the weights according to [this url](https://github.com/tatsu-lab/stanford_alpaca#recovering-alpaca-weights).):\n\n-   \"alpaca_7B\" (Alpaca 7B)\n\nKoala: A Dialogue Model for Academic Research (Requires setting the `HUGGINGFACE_KOALA_WEIGHTS_ROOT` environment variable and preprocessing the weights according to [this url](https://github.com/young-geng/EasyLM/blob/main/docs/koala.md).):\n\n-   \"koala_7B\" (Koala 7B)\n-   \"koala_13B_v1\" (Koala 13B V1)\n-   \"koala_13B_v2\" (Koala 13B V2)\n\nVicuna: An Open Chatbot Impressing GPT-4 (Requires setting the `HUGGINGFACE_VICUNA_WEIGHTS_ROOT` environment variable and preprocessing the weights according to [this url](https://github.com/lm-sys/FastChat#vicuna-weights).):\n\n-   \"vicuna_7B\" (Vicuna 7B)\n-   \"vicuna_13B\" (Vicuna 13B)\n\nAlpaca: A Strong, Replicable Instruction-Following Model (Requires setting the `HUGGINGFACE_ALPACA_WEIGHTS_ROOT` environment variable and preprocessing the weights according to [this url](https://github.com/tatsu-lab/stanford_alpaca#recovering-alpaca-weights).):\n\n-   \"alpaca_7B\" (Alpaca 7B)\n\nKoala: A Dialogue Model for Academic Research (Requires setting the `HUGGINGFACE_KOALA_WEIGHTS_ROOT` environment variable and preprocessing the weights according to [this url](https://github.com/young-geng/EasyLM/blob/main/docs/koala.md).):\n\n-   \"koala_7B\" (Koala 7B)\n-   \"koala_13B_v1\" (Koala 13B V1)\n-   \"koala_13B_v2\" (Koala 13B V2)\n\nVicuna: An Open Chatbot Impressing GPT-4 (Requires setting the `HUGGINGFACE_VICUNA_WEIGHTS_ROOT` environment variable and preprocessing the weights according to [this url](https://github.com/lm-sys/FastChat#vicuna-weights).):\n\n-   \"vicuna_7B\" (Vicuna 7B)\n-   \"vicuna_13B\" (Vicuna 13B)\n\nStableLM: Stability AI Language Models\n\n-   \"stable_lm_3B\" (StableLM Chat Tuned 3B model)\n-   \"stable_lm_7B\" (StableLM Chat Tuned 7B model)\n-   \"stable_lm_base_3B\" (StableLM Completion 3B model)\n-   \"stable_lm_base_7B\" (StableLM Completion 7B model)\n\nBard (Requires setting the `GOOGLE_BARD_SESSION_ID` environment variable. Get the value of this variable by first going to [https://bard.google.com/](https://bard.google.com/), then log in, press F12 for console, and go to the \"Application\" tab, then \"Cookies\", then copy the value of the \"\\_\\_Secure-1PSID\" cookie.):\n\n-   \"bard\" (Bard model)\n\nPaLM (Requires the vertex AI client libraries from Google: https://cloud.google.com/vertex-ai/docs/start/client-libraries), and a GCP project set up with the Vertex AI API enabled.):\n\n-   \"palm\" (PaLM model)\n\nClaude (Requires setting the `ANTHROPIC_API_KEY` environment variable)\n\n- \"claude\" (claude-1 model)\n- \"claude_100k\" (claude-100k-1 model)\n- \"calude_instant\" (claude-instant-1 model)\n- \"claude_instant_100k\" (claude-100k-instant-1 model)\n\n\n## Running the demos\n\nTo load the demos, install the library, and then use streamlit to run the demo:\n\n_Single Image End-to-End Demo:_ `streamlit run demos/single_image.py`\n\n## References\n\nIf you found this work useful, cite us:\n\n```\n@misc{\n  https://doi.org/10.48550/arxiv.2302.01328,\n  doi = {10.48550/ARXIV.2302.01328},\n  url = {https://arxiv.org/abs/2302.01328},\n  author = {Chan, David M. and Myers, Austin and Vijayanarasimhan, Sudheendra and Ross, David A. and Canny, John},\n  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},\n  title = {IC3: Image Captioning by Committee Consensus},\n  publisher = {arXiv},\n  year = {2023},\n  copyright = {arXiv.org perpetual, non-exclusive license}\n}\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidmchan%2Fcaption-by-committee","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidmchan%2Fcaption-by-committee","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidmchan%2Fcaption-by-committee/lists"}