{"id":18885535,"url":"https://github.com/togethercomputer/dragonfly","last_synced_at":"2025-08-10T06:05:56.230Z","repository":{"id":243120089,"uuid":"809224155","full_name":"togethercomputer/Dragonfly","owner":"togethercomputer","description":null,"archived":false,"fork":false,"pushed_at":"2024-10-17T21:54:13.000Z","size":17239,"stargazers_count":75,"open_issues_count":2,"forks_count":12,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-19T12:09:44.547Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/togethercomputer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-02T04:21:48.000Z","updated_at":"2025-04-06T17:28:30.000Z","dependencies_parsed_at":"2024-06-06T21:10:50.791Z","dependency_job_id":"adfd4519-4bdb-4462-bf58-bdd18cca33fd","html_url":"https://github.com/togethercomputer/Dragonfly","commit_stats":null,"previous_names":["togethercomputer/dragonfly"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/togethercomputer/Dragonfly","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2FDragonfly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2FDragonfly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2FDragonfly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2FDragonfly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/togethercomputer","download_url":"https://codeload.github.com/togethercomputer/Dragonfly/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/togethercomputer%2FDragonfly/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269683182,"owners_count":24458628,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T07:19:39.187Z","updated_at":"2025-08-10T06:05:56.194Z","avatar_url":"https://github.com/togethercomputer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/dragonfly_icon.png\" alt=\"Dragonfly\" style=\"width: 150px; display: block; margin-left: auto; margin-right: auto;\" /\u003e\n  \u003ch1\u003eDragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models\u003c/h1\u003e\n\u003c/div\u003e\n\n## 🔥 News\n- **Note**: We updated our codebase and arxiv paper with improved version of Dragonfly architecture. If you still want to use the old version of the code, it is still in [github branch](https://github.com/togethercomputer/Dragonfly/tree/dragonfly-v1).\n- [Our paper](https://arxiv.org/abs/2406.00977) is out on arxiv.\n- Our model checkpoints are out on huggingface 🤗 🚀: \n    - General: [`togethercomputer/Llama-3.1-8B-Dragonfly-v2`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-v2) \n    - Biomed: [`togethercomputer/Llama-3.1-8B-Dragonfly-Med-v2`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-Med-v2)\n\n\n## 📖 Introduction\n\n![Dragonfly framework](assets/model_overview.png)\n\nRecent advances in vision-language models (VLMs) have demonstrated the advantages of processing images at higher resolutions and utilizing multi-crop features to preserve native resolution details. However, despite these improvements, existing vision transformers (ViTs) still struggle to capture fine-grained details from less prominent objects, charts, and embedded text, limiting their effectiveness in certain tasks. In this paper, we go beyond recent high-resolution and multi-crop techniques by not only preserving the native resolution, but zooming in beyond it and extracting features from a large number of image sub-crops. This enhancement allows our model to better capture fine-grained details, overcoming the limitations of current ViTs. To manage the increased token count and computational complexity, we demonstrate that a simple mean-pooling aggregation over tokens is effective. Our model, Dragonfly, achieves competitive performance on general-domain tasks such as ScienceQA and AI2D, and excels in tasks requiring fine-grained image understanding, including TextVQA and ChartQA. Among models in the 7-8B parameter range, Dragonfly consistently ranks at the top across ten general-domain benchmarks, achieving the highest or second-highest scores in most cases, outperforming models that are significantly larger or trained on larger datasets. Our biomedical version, Dragonfly-Med, sets new benchmarks on several medical tasks, achieving 91.6% accuracy on SLAKE (compared to 84.8% for Med-Gemini), 67.1% token F1 score on Path-VQA (compared to 62.7% for Med-PaLM M), and attains state-of-the-art results across the majority of performance metrics. Overall, our work highlights the persistent challenge of engineering visual representations with fixed-resolution ViTs, and proposes a simple yet effective solution to address this issue and boost performance in both general and specialized domains. \n\n![Example Generations](assets/examples.png)\n\n\n# 📖 Table of Contents\n- [📖 Table of Contents](#-table-of-contents)\n  - [💿 Installation](#-installation)\n  - [🏁 Checkpoint](#-checkpoint)\n  - [🧠 Inference](#-inference)\n  - [📊 Dataset](#-dataset)\n  - [🏋️‍♂️ Training](#️️-training)\n    - [Stage 1](#stage-1)\n    - [Stage 2](#stage-2)\n  - [🏆 Credits](#-credits)\n  - [📚 BibTeX](#-bibtex)\n  - [🪪 License](#-license)\n\n\u003ca name=\"installation\"/\u003e\n\n## 💿 Installation\n\nCreate a conda environment and install necessary packages\n```bash\nconda env create -f environment.yml\nconda activate dragonfly_env\n```\n\nInstall flash attention\n```bash\npip install flash-attn --no-build-isolation\n```\n\nAs a final step, please run the following command. \n```bash\npip install --upgrade -e .\n```\n\n\u003ca name=\"checkpoint\"/\u003e\n\n## 🏁 Checkpoint\n\n*Note: These models are released under [Llama 3.1 Community License Agreement](LICENSE)*\n\nWe release two huggingface model checkpoints: [`togethercomputer/Llama-3.1-8B-Dragonfly-v2`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-v2) and [`togethercomputer/Llama-3.1-8B-Dragonfly-Med-v2`](https://huggingface.co/togethercomputer/Llama-3.1-8B-Dragonfly-Med-v2). Please follow the script [`test_dragonfly.py`](test_dragonfly.py) for more details. We provide a brief description on how to use them below.\n\n\u003ca name=\"inference\"/\u003e\n\n## 🧠 Inference\n\nIf you have successfully completed the [Installation](#installation) process, then you should be able to follow the steps below. \n\nWe provide two test examples inside [`assets`](assets). \n\nQuestion: What is so funny about this image?\n\n![Monalisa Dog](assets/monalisa_dog.jpg)\n\nLoad necessary packages\n```python\nimport torch\nfrom PIL import Image\nfrom transformers import AutoProcessor, AutoTokenizer\n\nfrom dragonfly.models.modeling_dragonfly import DragonflyForCausalLM\nfrom dragonfly.models.processing_dragonfly import DragonflyProcessor\nfrom pipeline.train.train_utils import random_seed\n```\n\nInstantiate the tokenizer, processor, and model. \n```python\ndevice = torch.device(\"cuda:0\")\n\ntokenizer = AutoTokenizer.from_pretrained(\"togethercomputer/Llama-3.1-8B-Dragonfly-v2\")\nclip_processor = AutoProcessor.from_pretrained(\"openai/clip-vit-large-patch14-336\")\nimage_processor = clip_processor.image_processor\nprocessor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style=\"llava-hd\")\n\nmodel = DragonflyForCausalLM.from_pretrained(\"togethercomputer/Llama-3.1-8B-Dragonfly-v2\")\nmodel = model.to(torch.bfloat16)\nmodel = model.to(device)\n```\n\nNow, lets load the image and process them.\n```python\nimage = Image.open(\"./assets/monalisa_dog.jpg\")\nimage = image.convert(\"RGB\")\nimages = [image]\n# images = [None] # if you do not want to pass any images\n\ntext_prompt = \"\u003c|start_header_id|\u003euser\u003c|end_header_id|\u003e\\n\\nWhat is so funny about this image?\u003c|eot_id|\u003e\u003c|start_header_id|\u003eassistant\u003c|end_header_id|\u003e\\n\\n\"\n\ninputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors=\"pt\", is_generate=True)\ninputs = inputs.to(device)\n```\n\nFinally, let us generate the responses from the model\n```python\ntemperature = 0\n\nwith torch.inference_mode():\n    generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode(\"\u003c|eot_id|\u003e\"), do_sample=temperature \u003e 0, temperature=temperature, use_cache=True)\n\ngeneration_text = processor.batch_decode(generation_output, skip_special_tokens=False)\n```\n\nAn example response.\n```plaintext\nThe humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci. The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a humerous effect that is likely to elicit laughter\u003c|eot_id|\u003e\n```\n\n\u003ca name=\"dataset\"/\u003e\n\n## 📊 Dataset\n\nWe will release it soon on HF hub. \n\n\u003ca name=\"training\"/\u003e\n\n## 🏋️‍♂️ Training\n\n*Note: This training recipe is specifically for our general domain model.*\n\nWe adopt a two-stage training process.\n\n### Stage 1\nIn this stage, we only train our projection layer, so that the model learns to map the embeddings from the vision encoder into the LLM space.\n\n```bash\nsh train_dragonfly_stage1.sh\n```\n\n### Stage 2\nIn this stage, we train our vision encoder, projection layer, and LLM jointly on image and text data.\n\n```bash\nsh train_dragonfly_stage2.sh\n```\n\nPlease ensure to update the paths inside the bash script according to your local file paths.\n\nFor both stages, the dataset is formatted in the following manner:\n\n```json\n{\n  \"image_url\": \"\u003cpath_to_image\u003e\",\n  \"conversations\": \"\u003ctext_data_formatted\u003e\",\n  \"source\": \"\u003cdata_source\u003e\"\n}\n```\n\nConversation format follows standard Llama 3 as follows. \n\n```plaintext\n\u003c|start_header_id|\u003euser\u003c|end_header_id|\u003e\n\nDescribe the content in the image.\u003c|eot_id|\u003e\u003c|start_header_id|\u003eassistant\u003c|end_header_id|\u003e\n```\n\n\n## 🏆 Credits\n\nWe would like to acknowledge the following resources that were instrumental in the development of Dragonfly:\n\n- [Meta Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B): We utilized the Llama 3 model as our foundational language model.\n- [CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336): Our vision backbone is CLIP model from OpenAI. \n- Our codebase is built upon the following two codebases:\n  - [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter)\n  - [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD)\n\n\u003ca name=\"bibtex\"/\u003e\n\n## 📚 BibTeX\n\n```bibtex\n@misc{thapa2024dragonfly,\n      title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models}, \n      author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},\n      year={2024},\n      eprint={2406.00977},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\u003ca name=\"license\"/\u003e\n\n## 🪪 License\n\n[META LLAMA 3](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftogethercomputer%2Fdragonfly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftogethercomputer%2Fdragonfly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftogethercomputer%2Fdragonfly/lists"}