{"id":19988567,"url":"https://github.com/mapluisch/llava-cli-with-multiple-images","last_synced_at":"2025-06-20T09:35:38.641Z","repository":{"id":210082293,"uuid":"725678678","full_name":"mapluisch/LLaVA-CLI-with-multiple-images","owner":"mapluisch","description":"LLaVA inference with multiple images at once for cross-image analysis.","archived":false,"fork":false,"pushed_at":"2024-03-25T18:05:54.000Z","size":25147,"stargazers_count":50,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-12T22:06:38.295Z","etag":null,"topics":["image-concatenation","image-processing","inference","llama2","llama2-13b","llava","lmm","lmms","pillow","python","python3","pytorch","visual-question-answering","vqa"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mapluisch.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-30T16:40:05.000Z","updated_at":"2025-06-11T08:45:04.000Z","dependencies_parsed_at":"2024-03-25T19:37:23.052Z","dependency_job_id":null,"html_url":"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images","commit_stats":{"total_commits":18,"total_committers":1,"mean_commits":18.0,"dds":0.0,"last_synced_commit":"674538c0b5cc4edaa47cf92c469c49b9ff17ba37"},"previous_names":["mapluisch/llava-cli-with-multiple-images"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mapluisch/LLaVA-CLI-with-multiple-images","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mapluisch%2FLLaVA-CLI-with-multiple-images","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mapluisch%2FLLaVA-CLI-with-multiple-images/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mapluisch%2FLLaVA-CLI-with-multiple-images/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mapluisch%2FLLaVA-CLI-with-multiple-images/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mapluisch","download_url":"https://codeload.github.com/mapluisch/LLaVA-CLI-with-multiple-images/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mapluisch%2FLLaVA-CLI-with-multiple-images/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260919538,"owners_count":23082783,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["image-concatenation","image-processing","inference","llama2","llama2-13b","llava","lmm","lmms","pillow","python","python3","pytorch","visual-question-answering","vqa"],"created_at":"2024-11-13T04:43:27.117Z","updated_at":"2025-06-20T09:35:33.629Z","avatar_url":"https://github.com/mapluisch.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eLLaVA CLI with multiple images\u003c/h1\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/88cd732a-fe3f-4ba4-94a5-3935371c5d34\" alt=\"banner\" style=\"width:50%\"\u003e\n  \u003cp\u003e\u003cem\u003eLLaVA inference combining multiple images into one for streamlined processing and cross-image analysis.\u003c/em\u003e\u003c/p\u003e\n\u003c/div\u003e\n\n\n## Setup\n0. You should follow the LLaVA tutorial, so that you have the pretrained model / checkpoint shards ready.\n1. Then, `cd` into your LLaVA root directory.\n2. Clone my repo (and optionally remove the test-images):\n```\ngit clone https://github.com/mapluisch/LLaVA-CLI-with-multiple-images.git \u0026\u0026 \\\n(cd LLaVA-CLI-with-multiple-images \u0026\u0026 \\\nrm -rf test-images \u0026\u0026 \\\ncp -a . ../) \u0026\u0026 \\\nrm -rf LLaVA-CLI-with-multiple-images\n```\n\nThis command simply clones the repo, removes the test-images folder, copies all the files into the actual working directory (your LLaVA root directory), and finally removes the repo's directory.\n\n## Usage\nWhile in your LLaVA directory, first activate the conda environment via `conda activate llava`.\nThen, simply call my script via `python` or `python3` with your preferred arguments.\n```\npython llava-multi-images.py [ARGS]\n```\n\n### Arguments\n\nGiven that this project is based on LLaVA's `cli.py`, the following base arguments can be specified:\n```\n--model-path, default=\"liuhaotian/llava-v1.6-vicuna-13b\"\n--model-base, default=None\n--device, default=\"cuda\"\n--conv-mode, default=None\n--temperature, default=0.2\n--max-new-tokens, default=512\n--load-8bit, action=\"store_true\"\n--load-4bit, action=\"store_true\"\n--debug, action=\"store_true\"\n```\n\nAdditionally added args:\n```\n--images\n--save-image, action=\"store_true\"\n--concat-strategy, default=\"vertical\", choices=[\"vertical\", \"horizontal\", \"grid\"]\n--dist-images, default=20\n--grid-resolution, default=2560,1440\n```\n\nUsing `--images /some/img1.jpg /some/img2.jpg /some/img_n.jpg`, or alternatively `--images /some/img{1-n}.jpg` if the images have the same location and prefix, you can specify as many images as you want for inference. These input images will get concatenated using PIL.\n\nUsing `--save-image`, the resulting concatenated image gets stored in the LLaVA directory as `concat-image.jpg`.\n\nUsing `--concat-strategy`, you can specify the arrangement of the concatenated images (see [Examples](##Examples)).\n\nUsing `--dist-images`, you can specify the spacing between images.\n\nUsing `--grid-resolution`, you can specify the output image's resolution when using grid placement.\n\n## Examples\n\n### Vertical Placement\n\n#### Input\nFor these two input images, `img1.jpg` \u0026 `img2.jpg`:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/66bda775-156b-404b-9171-41fa21e3c0c6\" height=\"250\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/51719905-0265-4333-8e5a-798598f7f79b\" height=\"250\"\u003e   \n\u003c/p\u003e\n\n#### Image Concatenation\n```\npython3 llava-multi-images.py --load-4bit --save-image --images test-images/img1.jpg test-images/img2.jpg --dist-images 250 --concat-strategy vertical\n```\n\nresults in this concatenated image being used for LLaVA's CLI conversation (empty space in the lower right due to resolution mismatch):\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/cdd3573f-b3e1-4e87-af8c-4266583691a4\" width=\"250\"\u003e\n\u003c/p\u003e\n\n#### LLaVA Response\nLLaVA can now answer questions about the merged image, understanding that it consists of separate images:\n\n\u003e USER: Analyze and explain which image is the most visually intriguing.\n\n\u003e ASSISTANT: Both images are visually striking, but they present different types of scenes and evoke different emotions.\n\u003e \n\u003e The top image captures a dramatic and serene landscape. It features a snow-covered mountain under a sky with the Aurora Borealis, also known as the Northern Lights. The aurora creates a vibrant, almost otherworldly glow that contrasts with the cool, white snow on the mountain. This image evokes a sense of tranquility and awe at the natural beauty of the scene.\n\u003e \n\u003e The bottom image shows a solitary tree in a field with dry grass. The tree stands out against the flat landscape, and the dry grass suggests a season like autumn or early spring. This image conveys a sense of solitude and the quiet beauty of nature.\n\u003e \n\u003e Both images are visually intriguing, but they appeal to different aspects of the viewer's experience. The top image is more dramatic and otherworldly, while the bottom image is more grounded and evokes a sense of solitude. The choice of which image is more visually intriguing would depend on personal preference and the emotions one wishes to evoke.\n\n\n### Horizontal Placement\n\n#### Input\nGiven these two input images, `img3.jpg` \u0026 `img4.jpg`:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/04b32564-78f1-4cbe-85cd-9eb113968e72\" height=\"250\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/26a852e2-583c-47ca-9d97-7ccac7e2596e\" height=\"250\"\u003e\n\u003c/p\u003e\n\n\n#### Image Concatenation\n```\npython3 llava-multi-images.py --load-4bit --save-image --images test-images/img3.jpg test-images/img4.jpg --dist-images 250 --concat-strategy horizontal\n```\n\nLLaVA receives this image (again, empty space in the lower right due to resolution mismatch):\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/01873626-9b5a-40b0-90b8-2eb96160f5d6\" width=\"500\"\u003e\n\u003c/p\u003e\n\n#### LLaVA Response\n...and can once again answer promptly:\n\n\u003e USER: Describe the main parts of each image.\n\n\u003e ASSISTANT: The image on the left shows a building with a sign that reads \"HOTEL\" prominently displayed. The building has a distinctive architectural style with a tower-like structure and a series of windows. The sky is overcast, and the lighting suggests it could be early morning or late afternoon.\n\u003e \n\u003e The image on the right depicts a road lined with trees on both sides, creating a tunnel-like effect. The road appears to be a highway or a major road, given the presence of a guardrail and the absence of any buildings or structures close to the road. The trees are lush and green, indicating that the photo was likely taken during a time of year when trees are in full leaf. The sky is clear and blue, suggesting a bright and sunny day.\n\n\n### Grid Placement\n#### Input\nFinally, given e.g. nine input images, in this case `img{5-13}.jpg`:\n\n\u003cp align=\"center\" style=\"display: flex; flex-wrap: nowrap; justify-content: center; align-items: center;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/e0bb537d-fb00-4a78-b60d-08ad9dbb412f\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/f908be2d-c57c-4894-acfa-49883575b9db\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/c87e66af-6b90-4225-91ec-e8646be80ec4\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/9ffd6d1f-e1e4-4f2a-8c9c-f7a47c4ccd99\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/8775d67d-9de6-4012-a121-34410d884f09\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/4763442f-57df-4ea8-86d4-ea3c4e1c140a\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/3e394f62-4827-4069-912e-39d5b84d05e1\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/e70c11db-41c1-4cc3-bc99-27b155490f29\" height=\"100\" style=\"margin-right: 10px;\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/6eb64fb0-8d28-4c11-afb1-2529ea20002b\" height=\"100\"\u003e\n\u003c/p\u003e\n\n#### Image Concatenation\n```\npython3 llava-multi-images.py --load-4bit --save-image --images test-images/img{5-13}.jpg --dist-images 150 --concat-strategy grid --temperature 0.7\n```\n\nLLaVA receives this new image with a size of `--grid-resolution`, which defaults to 2560x1440, consisting of a grid of images:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/mapluisch/LLaVA-CLI-with-multiple-images/assets/31780571/53a0de86-f777-431b-ae9d-f8b0830daf7d\" width=\"500\"\u003e\n\u003c/p\u003e\n\n#### LLaVA Response\nAgain, you can now pose questions to LLaVA, e.g., whether it can tell us a short rhyme about the different cars:\n\n\u003e USER: Write a rhyme about how many different cars you can see.\n\n\u003e ASSISTANT: In this collection, you'll see quite a few,\u003cbr\u003e\n\u003e A variety of wheels, with styles to pursue.\u003cbr\u003e\n\u003e Nine in total, each unique and grand,\u003cbr\u003e\n\u003e Ranging from sports to luxury and beyond.\u003cbr\u003e\n\u003e\n\u003e Some are sleek, some are chunky, some are fast,\u003cbr\u003e\n\u003e All with designs that'll leave you amazed.\u003cbr\u003e\n\u003e Brand names that shine like stars at night,\u003cbr\u003e\n\u003e Mercedes, Audi, and BMW, they all take flight.\u003cbr\u003e\n\u003e\n\u003e They're parked and zooming, on dirt and on tar,\u003cbr\u003e\n\u003e Each model a masterpiece, a work of art. \u003cbr\u003e\n\u003e From city streets to mountain paths, \u003cbr\u003e\n\u003e These cars are ready to conquer every stretch. \u003cbr\u003e\n\u003e\n\u003e So here's to the beauty and the power, \u003cbr\u003e\n\u003e To the speed and style, and every hour. \u003cbr\u003e\n\u003e For each model, a story, each one's a dream, \u003cbr\u003e\n\u003e In this collage, cars are the stars of the scene.\n\n## Disclaimer\nThis project is a prototype and serves as a basic example of using LLaVA CLI inference with multiple images at once. \u003cb\u003eI have not tested this extensively\u003c/b\u003e - I've tried both LLaVA v1.5 and v1.6 13b with 4-bit quantization. Results may and probably will vary depending on the model and quantization you choose. Feel free to create a PR.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmapluisch%2Fllava-cli-with-multiple-images","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmapluisch%2Fllava-cli-with-multiple-images","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmapluisch%2Fllava-cli-with-multiple-images/lists"}