{"id":28206269,"url":"https://github.com/bytedance-seed/sail","last_synced_at":"2025-07-21T08:03:33.958Z","repository":{"id":292077167,"uuid":"968924125","full_name":"ByteDance-Seed/SAIL","owner":"ByteDance-Seed","description":"Implementation for \"The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer\"","archived":false,"fork":false,"pushed_at":"2025-06-28T04:13:22.000Z","size":338,"stargazers_count":45,"open_issues_count":1,"forks_count":3,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-28T05:23:34.105Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ByteDance-Seed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-19T02:04:51.000Z","updated_at":"2025-06-28T04:13:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"d9687bc8-2ac8-410c-a9e4-70220b52cb55","html_url":"https://github.com/ByteDance-Seed/SAIL","commit_stats":null,"previous_names":["bytedance-seed/sail"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ByteDance-Seed/SAIL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FSAIL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FSAIL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FSAIL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FSAIL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ByteDance-Seed","download_url":"https://codeload.github.com/ByteDance-Seed/SAIL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ByteDance-Seed%2FSAIL/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266263056,"owners_count":23901355,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-17T10:09:24.002Z","updated_at":"2025-07-21T08:03:33.949Z","avatar_url":"https://github.com/ByteDance-Seed.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n 👋 Hi, everyone! \n    \u003cbr\u003e\n    We are \u003cb\u003eByteDance Seed team.\u003c/b\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n  You can get to know us better through the following channels👇\n  \u003cbr\u003e\n  \u003ca href=\"https://team.doubao.com/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Website-%231e37ff?style=for-the-badge\u0026logo=bytedance\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/user-attachments/assets/93481cda-a7f3-47f3-b333-fe6b3da86b78\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/WeChat-07C160?style=for-the-badge\u0026logo=wechat\u0026logoColor=white\"\u003e\u003c/a\u003e\n \u003ca href=\"https://www.xiaohongshu.com/user/profile/668e7e15000000000303157d?xsec_token=ABl2-aqekpytY6A8TuxjrwnZskU-6BsMRE_ufQQaSAvjc%3D\u0026xsec_source=pc_search\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/Xiaohongshu-%23FF2442?style=for-the-badge\u0026logo=xiaohongshu\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.zhihu.com/org/dou-bao-da-mo-xing-tuan-dui/\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/zhihu-%230084FF?style=for-the-badge\u0026logo=zhihu\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n![seed logo](https://github.com/user-attachments/assets/c42e675e-497c-4508-8bb9-093ad4d1f216)\n\n# The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer (SAIL)\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/bytedance/flux\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/SAIL-Project Page-yellow\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://arxiv.org/abs/2504.10462\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/SAIL-Tech Report-red\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/ByteDance-Seed/SAIL-7B\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/SAIL-Hugging Face-orange\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\n    \u003cimg src=\"https://img.shields.io/badge/License-Apache2.0-blue\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nWe are extremely delighted to release SAIL, a **S**ingle tr**A**nsformer model for v**I**sion and **L**anguage. SAIL is a unified multimodal large language model (MLLM) that seamlessly integrates raw pixel encoding and language decoding within a single architecture. **​Without relying on pre-trained vision encoders**, SAIL achieves competitive performance across a wide range of vision-language tasks and demonstrates strong visual representation, rivaling state-of-the-art vision models in tasks like semantic segmentation.\n\n## Model \u0026 Micro Design\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/sail_model.jpg\" alt=\"model\" style=\"height: 300; width: auto;\"\u003e\n\u003c/div\u003e\n\n## An Overview of Comparison\n(A) Data scaling curve for Modular Multimodal Large Language Model (MLLM) and SAIL, our Single Transformer-based MLLM. As pretraining data increases, SAIL shows a sharper performance gain, demonstrating  its superior data scalability.\n(B) Comparison to existing Single Transformer-based MLLMs, our SAIL pushes the performance boundaries on both vision tasks and vision-language tasks.\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/perf_cmp.jpg\" alt=\"cmp\" style=\"height: 250; width: auto;\"\u003e\n\u003c/div\u003e\n\n# News\n- [2025/06/26] SAIL is accepted to ICCV 2025.\n- [2025/04/02]🔥We release SAIL models and technical report.\n\n\n# Getting started\n### Prepraration\n```bash\npip3 install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1\npip3 install einops transformers==4.42.0\n```\n\n### Example\nFirstly, clone the SAIL repo,\n```bash\ngit clone https://github.com/bytedance/SAIL\ncd SAIL\n```\n\nand then, simpley run `example.py`\n```bash\npython3 example.py\n```\n\nor refer to the following code block:\n```python\nfrom example import *\n\nNON_VISION_TOKEN_ID = -1\nPATH_TO_MODEL = \"path to model\"\nPATH_TO_TOKENIZER = \"path to tokenizer\"\nIMAGE_PATH = \"path to image\"\nPROMPT = \"content of prompt\"\n\nmodel, tokenizer = get_transformer_and_tokenizer(\n    PATH_TO_MODEL,\n    PATH_TO_TOKENIZER\n)\nmodel = model.cuda()\n\nimage_processor = lambda x: convert_image_base64_to_patches(load_image_to_base64(x), model.config.vision_patch_size, fix_res_size=None)\nprompt_inp = tokenizer.bos_token + '[INST] {} [/INST]'.format(PROMPT)\nimage_path = IMAGE_PATH   \nimage_patches = image_processor(image_path)\nnh, nw = image_patches.shape[:2]\nimage_tokens, image_tokens_len = prepare_image_textual_seq_norowsep(nh, nw, tokenizer, add_cls=False)\n\ninput_tokens = image_tokens + prompt_inp\ninput_ids = tokenizer(input_tokens, add_special_tokens=False, return_tensors=\"pt\").input_ids\nvision_patch_indices = torch.full_like(input_ids, fill_value=NON_VISION_TOKEN_ID)\nvision_patches = image_patches.view(nh * nw, -1)\nassert (input_ids == tokenizer.vis_patch_tok_id).sum() == vision_patches.size(0)\nassert (input_ids \u003e= tokenizer.vis_beg_tok_id).sum() == image_tokens_len\n\nvision_patch_indices[input_ids==tokenizer.vis_patch_tok_id] = torch.arange(vision_patches.size(0))\nattention_mask = create_single_prefix_mask(image_tokens_len, input_ids.size(-1)).unsqueeze(0).unsqueeze(0)\nposition_ids = generate_mm_pos_ids_singleit(input_ids.squeeze(0).numpy().tolist(), tokenizer.vis_patch_tok_id, nh, nw).unsqueeze(1)\n\ninput_ids = input_ids.long().cuda()\nvision_patch_indices = vision_patch_indices.long().cuda()\nvision_patches = vision_patches.to(torch.bfloat16).cuda()\nposition_ids = position_ids.long().cuda()\nattention_mask = attention_mask.cuda()\n\npadding_attention_mask = torch.ones_like(input_ids).cuda()\n\ninputs = dict(\n    input_ids = input_ids,\n    position_ids = position_ids,\n    attention_mask = padding_attention_mask,\n    vision_patches = vision_patches,\n    vision_patch_indices = vision_patch_indices,\n    use_cache=True\n)\n\ncached_inputs = dict(\n    input_ids = input_ids[:, :image_tokens_len],\n    position_ids = position_ids[:, :, :image_tokens_len],\n    attention_mask = attention_mask[:,:, :image_tokens_len, :image_tokens_len],\n    vision_patches = vision_patches,\n    vision_patch_indices = vision_patch_indices[:, :image_tokens_len],\n    use_cache=True\n)\n\nprefix_cache = DynamicCache()\nwith torch.no_grad():\n    prefix_cache = model.forward(**cached_inputs, past_key_values=prefix_cache).past_key_values\n\npast_key_values = copy.deepcopy(prefix_cache)\ngenerate_config = GenerationConfig(\n    max_new_tokens=1024,\n    return_dict_in_generate=True,\n    output_attentions=False\n)\ngenerated = model.generate(\n    **inputs,\n    past_key_values=past_key_values,\n    generation_config=generate_config\n)\ngenerated_ids = generated['sequences'][:, input_ids.size(1):]\nresponse = tokenizer.batch_decode(\n    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False\n)[0]\n\nprint(f\"\\nModel Response: ===\\n{response}\\n===\")\n```\n\n# Features\n- SAIL as an MLLM, check our model at [huggingface](https://huggingface.co/ByteDance-Seed/SAIL-7B).\n- SAIL as a Vision Encoder, check our model at [huggingface](https://huggingface.co/models/ByteDance-Seed/SAIL-7B-PT).\n- Explore [Pixel SAIL](https://github.com/magic-research/Sa2VA), using SAIL For Pixel-Grounded Understanding.\n\n\n# Acknowledgement\nPart of our codes are built up on [SOLO](https://github.com/Yangyi-Chen/SOLO/tree/main).\nWe thank the authors for their impressive contribution.\n\n# License\nThis project is licensed under Apache2.0. See the [LICENSE](LICENSE). flie for details.\n\n# Citation\nIf you find SAIL useful for your research and applications, feel free to give us a star ⭐ or cite us using:\n\n```bibtex\n@article{lei2025sail,\n  title={The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer},\n  author={Lei, Weixian and Wang, Jiacong and Wang, Haochen and Li, Xiangtai and Liew, Jun Hao and Feng, Jiashi and Huang, Zilong},\n  journal={arXiv preprint arXiv:2504.10462},\n  year={2025}\n}\n```\n\n# About [ByteDance Seed Team](https://team.doubao.com/)\n\nFounded in 2023, ByteDance Seed Team is dedicated to crafting the industry's most advanced AI foundation models. The team aspires to become a world-class research team and make significant contributions to the advancement of science and society.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance-seed%2Fsail","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbytedance-seed%2Fsail","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbytedance-seed%2Fsail/lists"}