{"id":20161907,"url":"https://github.com/ailab-cvc/seed-x","last_synced_at":"2025-05-15T23:06:43.086Z","repository":{"id":230532842,"uuid":"779567940","full_name":"AILab-CVC/SEED-X","owner":"AILab-CVC","description":"Multimodal Models in Real World","archived":false,"fork":false,"pushed_at":"2025-02-24T11:57:44.000Z","size":54758,"stargazers_count":493,"open_issues_count":25,"forks_count":21,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-04-25T03:37:16.514Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AILab-CVC.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"License_Seed-X.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-03-30T07:03:38.000Z","updated_at":"2025-04-22T12:10:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"7f999696-a044-4ec2-8672-7f3c92ac8f83","html_url":"https://github.com/AILab-CVC/SEED-X","commit_stats":null,"previous_names":["ailab-cvc/seed-x"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FSEED-X","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FSEED-X/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FSEED-X/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AILab-CVC%2FSEED-X/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AILab-CVC","download_url":"https://codeload.github.com/AILab-CVC/SEED-X/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254436944,"owners_count":22070946,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T00:21:46.347Z","updated_at":"2025-05-15T23:06:38.054Z","avatar_url":"https://github.com/AILab-CVC.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SEED-X\n[![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/abs/2404.14396)\n[![Demo](https://img.shields.io/badge/ARC-Demo-blue)](https://arc.tencent.com/en/ai-demos/multimodal)\n[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/AILab-CVC/SEED-X-17B)\n[![arXiv](https://img.shields.io/badge/arXiv-2405.04007-b31b1b.svg)](https://arxiv.org/abs/2405.04007)\n[![Static Badge](https://img.shields.io/badge/Dataset-Huggingface-yellow)](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit)\n\nWe introduce [SEED-X](https://arxiv.org/abs/2404.14396), a unified and versatile foundation model, which can serve as various multimodal AI assistants **in the real world** after different instruction tuning, capable of responding to a variety of user needs through unifying **multi-granularity comprehension and generation**.\n\nAll models, instruction tuning code and inference code are released! \n\n## News\n**2024-07-12** :hugs: We release [SEED-Story](https://github.com/TencentARC/SEED-Story), a MLLM capable of generating multimodal long stories based on the pre-trained SEED-X (an earlier version). We also release StoryStream, a large-scale dataset designed for training and benchmarking multimodal story generation.\n\n**2024-05-21** :hugs: A new online [demo](https://arc.tencent.com/en/ai-demos/multimodal) of the general instruction-tuned model SEED-X-I is available, with faster inference speed than the demo using Zero GPU on huggingface.\n\n**2024-05-03** :hugs: We release **3.7M image editing data** [SEED-Data-Edit](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit), which includes (1) Large-scale high-quality editing data produced by an **automatic pipeline**, (2) **Real-world scenario data** scraped from the internet that more accurately reflects user image editing intentions, (3) High-precision **multi-turn** editing data annotated by Photoshop experts.\n\n**2024-05-02** :hugs: We release the **training code** for instruction tuning from the pre-trained foundation model **SEED-X**. Our codebase supports (a) large-scale multi-node training with deepspeed zero-2 and zero-3, (b) highly-efficient multiple training datapipes. To the best of our knowledge, our SEED series is the first open-source work on training MLLM that unifies multimodal comprehension and generation.\n\n**2024-04-27** :hugs: We release the [models](https://huggingface.co/AILab-CVC/SEED-X-17B) including the pre-trained foundation model **SEED-X**, the general instruction-tuned model **SEED-X-I**, the editing model **SEED-X-Edit**, and our de-tokenier, which can generate realistic images from ViT features (w/o or w/ a condition image).\n\n**2024-04-22** :hugs: We release an online [gradio demo](https://huggingface.co/spaces/tttoaster/SEED-X-17B) of a general instruction-tuned model SEED-X-I. SEED-X-I can follow multimodal instruction (including images with dynamic resolutions) and make responses with images, texts and bounding boxes in multi-turn conversation. SEED-X-I **does not support image manipulation**. If you want to experience SEED-X-Edit for high-precision image editing, the inference code and model will be released soon.\n\n## TODOs\n- [x] Release the multimodal foundation model SEED-X.\n- [x] Release the instruction-tuned model SEED-X-Edit for high-precision image editing.\n- [x] Release 3.7M in-house image editing data.\n- [x] Release trainig code for instruction tuning.\n\n## Introduction\n![image](https://github.com/AILab-CVC/SEED-X/blob/main/demos/teaser.jpg?raw=true)\n\n![image](https://github.com/AILab-CVC/SEED-X/blob/main/demos/case_example.jpg?raw=true)\nThe introduced SEED-X, a unified and versatile foundation model, can serve as various multimodal AI assistants **in the real world** after different instruction tuning, capable of responding to a variety of user needs through unifying\n**multi-granularity comprehension and generation**. Our instruction tuned models can\nfunction as an interactive designer, generating images without descriptive captions while illustrating\ncreative intent, and showcasing visualizations of modified images based on user’s intent. They can act\nas knowledgeable personal assistants, comprehending images of arbitrary sizes and offering relevant\nsuggestions in multi-turn conversations.\n\n![image](https://github.com/AILab-CVC/SEED-X/blob/main/demos/combination_seed_x.jpg?raw=true)\nSEED-X is able to take multiple images as input, and follow its aesthetic vision and transform mediocre (or even low-resolution and low-quality) photos into something more impressive (In the example above, some of the input images have a resolution lower than 200). Feel free to try it by yourself in [demo](https://arc.tencent.com/en/ai-demos/multimodal) (Set \"Force Image Generation\" as True).\n\n## [SEED-Data-Edit](https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit)\n![image](https://github.com/AILab-CVC/SEED-X/blob/main/demos/SEED-Data-Edit.jpg?raw=true)\nData examples of instruction-guided image editing in SEED-Data-Edit, which includes (1)\nHigh-quality editing data produced by an **automatic pipeline** (first row), (2) **Real-world scenario\ndata** scraped from the internet that more accurately reflects user image editing intentions (second\nrow), (3) High-precision **multi-turn** editing data annotated by Photoshop experts (third row).\n\n## [SEED-Story](https://github.com/TencentARC/SEED-Story)\nThe introduced SEED-Story, powered by SEED-X, is capable of **generating multimodal long stories** from user-provided images and texts as the beginning of the story. The generated story consists of rich and coherent narrative texts, along with images that are consistent in characters and style. The story can span up to 25 multimodal sequences, even though we only use a maximum of 10 sequences during training.\n\n![image](https://github.com/TencentARC/SEED-Story/blob/master/assets/teaser.jpg)\n\n## Usage\n\n### Dependencies\n- Python \u003e= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))\n- [PyTorch \u003e=2.0.1](https://pytorch.org/)\n- NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads)\n\n### Installation\nClone the repo and install dependent packages\n\n  ```bash\n  git clone https://github.com/AILab-CVC/SEED-X.git\n  cd SEED-X\n  pip install -r requirements.txt\n  ```\n\n### Model Weights\nWe release the pretrained De-Tokenizer, the pre-trained foundation model **SEED-X**, the general instruction-tuned model **SEED-X-I**, the editing model **SEED-X-Edit** in in [SEED-X-17B Hugging Face](https://huggingface.co/AILab-CVC/SEED-X-17B).\n\nYou can also download them separately as below,\n- Check the SEED-X de-tokenizer weights in [AILab-CVC/seed-x-17b-de-tokenizer](https://huggingface.co/AILab-CVC/seed-x-17b-de-tokenizer)\n- Check the pre-trained foundation model **SEED-X** weights in [AILab-CVC/seed-x-17b-pretrain](https://huggingface.co/AILab-CVC/seed-x-17b-pretrain)\n- Check the general instruction-tuned model **SEED-X-I** weights in [AILab-CVC/seed-x-17b-instruct](https://huggingface.co/AILab-CVC/seed-x-17b-instruct)\n- Check  the editing model **SEED-X-Edit** weights in [AILab-CVC/seed-x-17b-edit](https://huggingface.co/AILab-CVC/seed-x-17b-edit)\n\nPlease download the checkpoints and save them under the folder `./pretrained`. For example, `./pretrained/seed_x`.\n\nYou also need to download [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat), and save them under the folder `./pretrained`. Please use the following script to extract the weights of visual encoder in Qwen-VL-Chat.\n```bash\npython3 src/tools/reload_qwen_vit.py\n```\n### Inference\n#### Inference with SEED-X De-tokenizer\n```bash\n# For image reconstruction with ViT image features\npython3 src/inference/eval_seed_x_detokenizer.py\n# For image reconstruction with ViT image features and conditional image\npython3 src/inference/eval_seed_x_detokenizer_with_condition.py\n```\n\n#### Inference with pre-trained model SEED-X\n```bash\n# For image comprehension and detection\npython3 src/inference/eval_img2text_seed_x.py\n# For image generation\npython3 src/inference/eval_text2img_seed_x.py\n```\n\n#### Inference with the general instruction-tuned model SEED-X-I\n```bash\n# For image comprehension and detection\npython3 src/inference/eval_img2text_seed_x_i.py\n# For image generation\npython3 src/inference/eval_text2img_seed_x_i.py\n```\n\n#### Inference with the editing model SEED-X-Edit\n```bash\n# For image editing\npython3 src/inference/eval_img2edit_seed_x_edit.py\n```\n\n### Instruction Tuning\n#### Training\n1. Prepare the pretrained models including the pre-trained foundation model **SEED-X** and the visual encoder of Qwen-VL-Chat (See Model Weights).\n2. Prepare the instruction tuning data. For example, for \"build_llava_jsonl_datapipes\" dataloader, each folder stores a number of jsonl files, each jsonl file contains 10K pieces of content, with an example of the content as follows:\n```bash\n{\"image\": \"coco/train2017/000000033471.jpg\", \"data\": [\"What are the colors of the bus in the image?\", \"The bus in the image is white and red.\", \"What feature can be seen on the back of the bus?\", \"The back of the bus features an advertisement.\", \"Is the bus driving down the street or pulled off to the side?\", \"The bus is driving down the street, which is crowded with people and other vehicles.\"]}\n```\n\nFor \"build_caption_datapipes_with_pixels\" dataloder, each folder stores a number of .tar files and reads image-text pairs in the form of webdataset.\n\nFor \"build_single_turn_edit_datapipes\" dataloder,  each folder stores a number of jsonl files, each jsonl file contains 10K pieces of content, with an example of the content as follows:\n```bash\n{\"source_image\": \"source_images/f6f4d0669694df5b.jpg\", \"target_image\": \"target_images/f6f4d0669694df5b.jpg\", \"instruction\": \"Erase the car that is parked in front of the Roebuck building.\"}\n```\n3. Run the following script.\n\n```bash\n# For general instruction tuning for multimodal comprehension and generation\nsh scripts/train_seed_x_sft_comp_gen.sh\n```\n\n```bash\n# For training language-guided image editing\nsh scripts/train_seed_x_sft_edit.sh\n```\n#### Inference with your own model\n1. Obtain \"pytorch_model.bin\" with the following script.\n```bash\ncd train_output/seed_x_sft_comp_gen/checkpoint-xxxx\npython3 zero_to_fp32.py . pytorch_model.bin\n```\n2. Change \"pretrained_model_path\" in \"configs/clm_models/agent_seed_x.yaml\" with the new checkpoint. For example,\n```bash\npretrained_model_path: train_output/seed_x_sft_comp_gen/checkpoint-4000/pytorch_model.bin\n```\n3. Change the \"llm_cfg_path\" and \"agent_cfg_path\" in the inference script (See below), which will automatically load the trained LoRA weights onto the pretrained model SEED-X.\n```bash\nllm_cfg_path = 'configs/clm_models/llm_seed_x_lora.yaml'\nagent_cfg_path = 'configs/clm_models/agent_seed_x.yaml'\n```\n4. Run the inference script,\n```bash\n# For image comprehension\npython3 src/inference/eval_img2text_seed_x_i.py\n# For image generation\npython3 src/inference/eval_text2img_seed_x_i.py\n# For image editing\npython3 src/inference/eval_img2edit_seed_x_edit.py\n```\n\n\n## Citation\nIf you find the work helpful, please consider citing:\n```bash\n@article{ge2024seed,\n  title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation},\n  author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying},\n  journal={arXiv preprint arXiv:2404.14396},\n  year={2024}\n}\n```\n\n\n## License\n`SEED` is licensed under the Apache License Version 2.0 except for the third-party components listed in [License](License_Seed-X.txt). \n\nDuring training SEED-X, we freeze the original parameters of LLaMA2 and optimize the LoRA module.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Failab-cvc%2Fseed-x","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Failab-cvc%2Fseed-x","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Failab-cvc%2Fseed-x/lists"}