{"id":19279781,"url":"https://github.com/showlab/visorgpt","last_synced_at":"2025-04-22T00:33:01.068Z","repository":{"id":168567397,"uuid":"643519135","full_name":"showlab/VisorGPT","owner":"showlab","description":"[NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT","archived":false,"fork":false,"pushed_at":"2024-05-04T01:51:17.000Z","size":126274,"stargazers_count":136,"open_issues_count":4,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-20T01:33:14.090Z","etag":null,"topics":["controlnet","diffusion-models","gpt","image-generation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/showlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-21T12:29:06.000Z","updated_at":"2025-02-10T05:58:20.000Z","dependencies_parsed_at":"2024-05-04T02:43:54.247Z","dependency_job_id":null,"html_url":"https://github.com/showlab/VisorGPT","commit_stats":null,"previous_names":["sierkinhane/visorgpt","showlab/visorgpt"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FVisorGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FVisorGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FVisorGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/showlab%2FVisorGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/showlab","download_url":"https://codeload.github.com/showlab/VisorGPT/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250158021,"owners_count":21384334,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["controlnet","diffusion-models","gpt","image-generation"],"created_at":"2024-11-09T21:16:06.694Z","updated_at":"2025-04-22T00:33:01.046Z","avatar_url":"https://github.com/showlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[//]: # (\u003cdiv align=center\u003e)\r\n\r\n[//]: # (\u003cimg src=\"visorgpt_title.png\" width=\"400\"\u003e)\r\n\r\n[//]: # (\u003c/div\u003e)\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n\u003ch1\u003eVisorGPT 🎨 (NeurIPS 2023)\u003c/h1\u003e\r\n\u003ch3\u003eLearning Visual Prior via Generative Pre-Training\u003c/h3\u003e\r\n\r\n\r\n[Jinheng Xie](https://sierkinhane.github.io/)\u003csup\u003e1\u003c/sup\u003e\u0026nbsp; Kai Ye\u003csup\u003e2\u003c/sup\u003e\u0026nbsp; Yudong Li\u003csup\u003e2\u003c/sup\u003e\u0026nbsp; Yuexiang Li\u003csup\u003e3\u003c/sup\u003e\u0026nbsp; Yefeng Zheng\u003csup\u003e3\u003c/sup\u003e Linlin Shen\u003csup\u003e2\u003c/sup\u003e\u0026nbsp; [Mike Zheng Shou](https://scholar.google.com/citations?hl=zh-CN\u0026user=h1-3lSoAAAAJ\u0026view_op=list_works\u0026sortby=pubdate)\u003csup\u003e1\u003c/sup\u003e \r\n\r\n\u003csup\u003e1\u003c/sup\u003e National University of Singapore\u0026nbsp; \u003csup\u003e2\u003c/sup\u003e Shenzhen University\u0026nbsp; \u003csup\u003e3\u003c/sup\u003e Jarvis Research Center, Tencent YouTu Lab\r\n\r\n[![arXiv](https://img.shields.io/badge/arXiv-\u003c2305.13777\u003e-\u003cCOLOR\u003e.svg)](http://arxiv.org/abs/2305.13777) [![demo](https://img.shields.io/badge/demo-\u003chuggingface\u003e-\u003cCOLOR\u003e.svg)](https://huggingface.co/spaces/szukevin/VISOR-GPT) [![video](https://img.shields.io/badge/video-\u003cyoutube\u003e-\u003cCOLOR\u003e.svg)](https://www.youtube.com/watch?v=8FDoBfxSY8I) [![webpage](https://img.shields.io/badge/webpage-\u003cgithub.io\u003e-\u003cCOLOR\u003e.svg)](https://sierkinhane.github.io/visor-gpt/)\r\n\r\n\u003c/div\u003e\r\n\r\n\u003cimg src=\"demo.gif\" width=\"1000\"\u003e\r\n\r\n## Updates\r\n\r\n- [2023/05/23] Paper is available.\r\n- [2023/05/28] Gradio demo is available.\r\n- [2023/05/30] [Hugging Face demo is available](https://huggingface.co/spaces/szukevin/VISOR-GPT).\r\n- [2023/06/13] Training code and data are available.\r\n- [2023/09/22] VisorGPT is accepted by **NeurIPS 2023**.\r\n\r\n## Quick Start\r\n\r\n### Step 1\r\n\r\n```\r\n# clone the repo\r\ngit clone https://github.com/Sierkinhane/VisorGPT.git\r\n\r\n# go to directory\r\ncd VisorGPT\r\n\r\n# create a new environment\r\nconda create -n visorgpt python=3.8\r\n\r\n# activate the new environment\r\nconda activate visorgpt\r\n\r\n# prepare the basic environments\r\npip3 install -r requirements.txt\r\n\r\n# install controlnet and gligen\r\ncd demo/ControlNet\r\npip3 install -v -e .\r\ncd ../demo/GLIGEN\r\npip3 install -v -e .\r\n```\r\n\r\n### Step 2 - Download pre-trained weights\r\n\r\nDownload [visorgpt](https://drive.google.com/file/d/1Pk4UPNKBMH-0uRLmK5COYTca7FUrN8XY/view?usp=sharing), [controlnet-pose2img](https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_openpose.pth), [controlnet-sd](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.safetensors), [gligen-bbox2img](https://huggingface.co/gligen/gligen-generation-text-box/blob/main/diffusion_pytorch_model.bin), and put them as follow:\r\n\r\n```\r\n├── demo/\r\n|   ├── ckpts\r\n|   |   ├── controlnet\r\n|   |   |   ├── control_v11p_sd15_openpose.pth\r\n|   |   |   ├── v1-5-pruned-emaonly.safetensors\r\n|   |   ├── gligen\r\n|   |   |   ├── diffusion_pytorch_model_box.bin\r\n|   |   ├── visorgpt\r\n|   |   |   ├── visorgpt_dagger_ta_tb.pt\r\n```\r\n\r\n### Step 3 - Run demo\r\n\r\n```\r\nCUDA_VISIBLE_DEVICES=0 python3 gradio_demo.py\r\n```\r\n\r\n## Training\r\n1. Download the preprocessed json files from [here](https://drive.google.com/drive/folders/1PL3RMPLUT3bFB-RHtMBzVkOLbQu_rDJF?usp=sharing).\r\n2. Process them into text corpora,\r\ne.g.,\r\n```\r\n# box type\r\npython3 preprocess_coord.py --input_path path/to/coco_train.json --data_type box --output_dir txt_train\r\n# keypoint type\r\npython3 preprocess_coord.py --input_path path/to/cocokeypoints_train.json --data_type keypoint --output_dir txt_train\r\n# mask type\r\npython3 preprocess_coord.py --input_path path/to/coco_train.json --data_type mask --output_dir txt_train\r\n```\r\n3. If you have processed several .txt files, you can merge them into one `.txt` file, e.g.,\r\n```\r\npython3 utiles/merge_files.py --file_dir txt_train --output_file_path train.txt\r\n```\r\n4. Tokenize the text corpora.\r\n```\r\ncd train/\r\npython3 preprocess.py --corpus_path ../train.txt \\\r\n                      --vocab_path models/google_uncased_en_coord_vocab.txt \\\r\n                      --dataset_path train.pt --processes_num 8 \\\r\n                      --seq_length 1024 --tgt_seq_length 1024 --data_processor lm\r\n```\r\n5. Train GPT-2 (based) model. The training process requires 8 V100(32GB).\r\n```\r\ndeepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \\\r\n                    --dataset_path train.pt \\\r\n                    --vocab_path models/google_uncased_en_coord_vocab.txt \\\r\n                    --config_path models/gpt2/config.json \\\r\n                    --output_model_path train.bin \\\r\n                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\r\n                    --total_steps 200000 --save_checkpoint_steps 5000 --report_steps 100 \\\r\n                    --learning_rate 5e-5 --batch_size 16\r\n```\r\nOr you can directly download the tokenized data from [here](https://drive.google.com/file/d/1VVw7zypNtkiMwJa3exGVZ31XnZCjYU6f/view?usp=sharing) (around 340K sequences)  and put it into the directory of `train/`. \r\n```\r\ndeepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \\\r\n                    --dataset_path visorgpt_dagger_train_seq.pt \\\r\n                    --vocab_path models/google_uncased_en_coord_vocab.txt \\\r\n                    --config_path models/gpt2/config.json \\\r\n                    --output_model_path models/visorgpt_dagger_train_seq.bin \\\r\n                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\r\n                    --total_steps 200000 --save_checkpoint_steps 10000 --report_steps 100 \\\r\n                    --learning_rate 5e-5 --batch_size 16\r\n```\r\n\r\n## Inference\r\n```\r\nCUDA_VISIBLE_DEVICES=0 python3 scripts/generate_lm_multiple.py --load_model_path models/visorgpt_dagger_train_seq.bin/200000/mp_rank_00_model_states.pt \\\r\n                               --vocab_path models/google_uncased_en_coord_vocab.txt \\\r\n                               --test_path beginning.txt --prediction_path generated_sentence.txt \\\r\n                               --config_path models/gpt2/config.json --seq_length 512\r\n                               \r\nor \r\nCUDA_VISIBLE_DEVICES=0 python3 scripts/generate_lm_multiple.py --load_model_path models/visorgpt_dagger_train_seq.bin \\\r\n                               --vocab_path models/google_uncased_en_coord_vocab.txt \\\r\n                               --test_path beginning.txt --prediction_path generated_sentence.txt \\\r\n                               --config_path models/gpt2/config.json --seq_length 512\r\n```\r\n## Visualization\r\n```\r\ncd ../\r\npython utils/seq2coord.py --file_path path/to/your/inference/txt --visualize\r\n```\r\nThe visualization results will be saved at `./debug`\r\n\r\nIf you are using our code, please consider citing our paper.\r\n\r\n```\r\n@inproceedings{xie2023learning,\r\ntitle={Learning Visual Prior via Generative Pre-Training},\r\nauthor={Jinheng Xie and Kai Ye and Yudong Li and Yuexiang Li and Kevin Qinghong Lin and Yefeng Zheng and Linlin Shen and Mike Zheng Shou},\r\nbooktitle={Thirty-seventh Conference on Neural Information Processing Systems},\r\nyear={2023},\r\n}\r\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Fvisorgpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshowlab%2Fvisorgpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshowlab%2Fvisorgpt/lists"}