{"id":13467228,"url":"https://github.com/nerdyrodent/VQGAN-CLIP","last_synced_at":"2025-03-26T01:30:39.526Z","repository":{"id":37076521,"uuid":"382313011","full_name":"nerdyrodent/VQGAN-CLIP","owner":"nerdyrodent","description":"Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.","archived":false,"fork":false,"pushed_at":"2022-10-02T12:22:31.000Z","size":33247,"stargazers_count":2643,"open_issues_count":26,"forks_count":429,"subscribers_count":53,"default_branch":"main","last_synced_at":"2025-03-24T00:18:45.591Z","etag":null,"topics":["text-to-image","text2image"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nerdyrodent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-07-02T10:35:57.000Z","updated_at":"2025-03-18T12:59:16.000Z","dependencies_parsed_at":"2022-06-24T18:21:42.422Z","dependency_job_id":null,"html_url":"https://github.com/nerdyrodent/VQGAN-CLIP","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerdyrodent%2FVQGAN-CLIP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerdyrodent%2FVQGAN-CLIP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerdyrodent%2FVQGAN-CLIP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nerdyrodent%2FVQGAN-CLIP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nerdyrodent","download_url":"https://codeload.github.com/nerdyrodent/VQGAN-CLIP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245571702,"owners_count":20637377,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["text-to-image","text2image"],"created_at":"2024-07-31T15:00:54.263Z","updated_at":"2025-03-26T01:30:39.518Z","avatar_url":"https://github.com/nerdyrodent.png","language":"Python","funding_links":[],"categories":["Python","Applications"],"sub_categories":["GAN"],"readme":"# VQGAN-CLIP Overview\n\nA repo for running VQGAN+CLIP locally. This started out as a Katherine Crowson VQGAN+CLIP derived Google colab notebook.\n\n\u003ca href=\"https://replicate.ai/nerdyrodent/vqgan-clip\"\u003e\u003cimg src=\"https://img.shields.io/static/v1?label=Replicate\u0026message=Demo and Docker Image\u0026color=blue\"\u003e\u003c/a\u003e\n\nOriginal notebook: [![Open In Colab][colab-badge]][colab-notebook]\n\n[colab-notebook]: \u003chttps://colab.research.google.com/drive/1ZAus_gn2RhTZWzOWUpPERNC0Q8OhZRTZ\u003e\n[colab-badge]: \u003chttps://colab.research.google.com/assets/colab-badge.svg\u003e\n\nSome example images:\n\n\u003cimg src=\"./samples/Cartoon3.png\" width=\"256px\"\u003e\u003c/img\u003e\u003cimg src=\"./samples/Cartoon.png\" width=\"256px\"\u003e\u003c/img\u003e\u003cimg src=\"./samples/Cartoon2.png\" width=\"256px\"\u003e\u003c/img\u003e\n\u003cimg src=\"./samples/Bedroom.png\" width=\"256px\"\u003e\u003c/img\u003e\u003cimg src=\"./samples/DemonBiscuits.png\" width=\"256px\"\u003e\u003c/img\u003e\u003cimg src=\"./samples/Football.png\" width=\"256px\"\u003e\u003c/img\u003e\n\u003cimg src=\"./samples/Fractal_Landscape3.png\" width=\"256px\"\u003e\u003c/img\u003e\u003cimg src=\"./samples/Games_5.png\" width=\"256px\"\u003e\u003c/img\u003e\n\nEnvironment:\n\n* Tested on Ubuntu 20.04\n* GPU: Nvidia RTX 3090\n* Typical VRAM requirements:\n  * 24 GB for a 900x900 image\n  * 10 GB for a 512x512 image\n  * 8 GB for a 380x380 image\n\nYou may also be interested in [CLIP Guided Diffusion](https://github.com/nerdyrodent/CLIP-Guided-Diffusion)\n\n## Set up\n\nThis example uses [Anaconda](https://www.anaconda.com/products/individual#Downloads) to manage virtual Python environments.\n\nCreate a new virtual Python environment for VQGAN-CLIP:\n\n```sh\nconda create --name vqgan python=3.9\nconda activate vqgan\n```\n\nInstall Pytorch in the new enviroment:\n\nNote: This installs the CUDA version of Pytorch, if you want to use an AMD graphics card, read the [AMD section below](#using-an-amd-graphics-card).\n\n```sh\npip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html\n```\n\nInstall other required Python packages:\n\n```sh\npip install ftfy regex tqdm omegaconf pytorch-lightning IPython kornia imageio imageio-ffmpeg einops torch_optimizer setuptools==59.5.0\n```\n\nOr use the ```requirements.txt``` file, which includes version numbers.\n\nClone required repositories:\n\n```sh\ngit clone 'https://github.com/nerdyrodent/VQGAN-CLIP'\ncd VQGAN-CLIP\ngit clone 'https://github.com/openai/CLIP'\ngit clone 'https://github.com/CompVis/taming-transformers'\n```\n\nNote: In my development environment both CLIP and taming-transformers are present in the local directory, and so aren't present in the `requirements.txt` or `vqgan.yml` files.\n\nAs an alternative, you can also pip install taming-transformers and CLIP.\n\nYou will also need at least 1 VQGAN pretrained model. E.g.\n\n```sh\nmkdir checkpoints\n\ncurl -L -o checkpoints/vqgan_imagenet_f16_16384.yaml -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fconfigs%2Fmodel.yaml\u0026dl=1' #ImageNet 16384\ncurl -L -o checkpoints/vqgan_imagenet_f16_16384.ckpt -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fckpts%2Flast.ckpt\u0026dl=1' #ImageNet 16384\n```\nNote that users of ```curl``` on Microsoft Windows should use double quotes.\n\nThe `download_models.sh` script is an optional way to download a number of models. By default, it will download just 1 model.\n\nSee \u003chttps://github.com/CompVis/taming-transformers#overview-of-pretrained-models\u003e for more information about VQGAN pre-trained models, including download links.\n\nBy default, the model .yaml and .ckpt files are expected in the `checkpoints` directory.\nSee \u003chttps://github.com/CompVis/taming-transformers\u003e for more information on datasets and models.\n\nVideo guides are also available:\n* Linux - https://www.youtube.com/watch?v=1Esb-ZjO7tw\n* Windows - https://www.youtube.com/watch?v=XH7ZP0__FXs\n\n### Using an AMD graphics card\n\nNote: This hasn't been tested yet.\n\nROCm can be used for AMD graphics cards instead of CUDA. You can check if your card is supported here:\n\u003chttps://github.com/RadeonOpenCompute/ROCm#supported-gpus\u003e\n\nInstall ROCm accordng to the instructions and don't forget to add the user to the video group:\n\u003chttps://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html\u003e\n\nThe usage and set up instructions above are the same, except for the line where you install Pytorch.\nInstead of `pip install torch==1.9.0+cu111 ...`, use the one or two lines which are displayed here (select Pip -\u003e Python-\u003e ROCm):\n\u003chttps://pytorch.org/get-started/locally/\u003e\n\n### Using the CPU\n\nIf no graphics card can be found, the CPU is automatically used and a warning displayed.\n\nRegardless of an available graphics card, the CPU can also be used by adding this command line argument: `-cd cpu`\n\nThis works with the CUDA version of Pytorch, even without CUDA drivers installed, but doesn't seem to work with ROCm as of now.\n\n### Uninstalling\n\nRemove the Python enviroment:\n\n```sh\nconda remove --name vqgan --all\n```\n\nand delete the `VQGAN-CLIP` directory.\n\n## Run\n\nTo generate images from text, specify your text prompt as shown in the example below:\n\n```sh\npython generate.py -p \"A painting of an apple in a fruit bowl\"\n```\n\n\u003cimg src=\"./samples/A_painting_of_an_apple_in_a_fruitbowl.png\" width=\"256px\"\u003e\u003c/img\u003e\n\n## Multiple prompts\n\nText and image prompts can be split using the pipe symbol in order to allow multiple prompts.\nYou can also use a colon followed by a number to set a weight for that prompt. For example:\n\n```sh\npython generate.py -p \"A painting of an apple in a fruit bowl | psychedelic | surreal:0.5 | weird:0.25\"\n```\n\n\u003cimg src=\"./samples/Apple_weird.png\" width=\"256px\"\u003e\u003c/img\u003e\n\nImage prompts can be split in the same way. For example:\n\n```sh\npython generate.py -p \"A picture of a bedroom with a portrait of Van Gogh\" -ip \"samples/VanGogh.jpg | samples/Bedroom.png\"\n```\n\n### Story mode\n\nSets of text prompts can be created using the caret symbol, in order to generate a sort of story mode. For example:\n\n```sh\npython generate.py -p \"A painting of a sunflower|photo:-1 ^ a painting of a rose ^ a painting of a tulip ^ a painting of a daisy flower ^ a photograph of daffodil\" -cpe 1500 -zvid -i 6000 -zse 10 -vl 20 -zsc 1.005 -opt Adagrad -lr 0.15 -se 6000\n```\n\n\n## \"Style Transfer\"\n\nAn input image with style text and a low number of iterations can be used create a sort of \"style transfer\" effect. For example:\n\n```sh\npython generate.py -p \"A painting in the style of Picasso\" -ii samples/VanGogh.jpg -i 80 -se 10 -opt AdamW -lr 0.25\n```\n\n| Output                                                        | Style       |\n| ------------------------------------------------------------- | ----------- |\n| \u003cimg src=\"./samples/vvg_picasso.png\" width=\"256px\"\u003e\u003c/img\u003e     | Picasso     |\n| \u003cimg src=\"./samples/vvg_sketch.png\" width=\"256px\"\u003e\u003c/img\u003e      | Sketch      |\n| \u003cimg src=\"./samples/vvg_psychedelic.png\" width=\"256px\"\u003e\u003c/img\u003e | Psychedelic |\n\nA video style transfer effect can be achived by specifying a directory of video frames in `video_style_dir`. Output will be saved in the steps directory, using the original video frame filenames. You can also use this as a sort of \"batch mode\" if you have a directory of images you want to apply a style to. This can also be combined with Story Mode if you don't wish to apply the same style to every images, but instead roll through a list of styles.\n\n## Feedback example\n\nBy feeding back the generated images and making slight changes, some interesting effects can be created.\n\nThe example `zoom.sh` shows this by applying a zoom and rotate to generated images, before feeding them back in again.\nTo use `zoom.sh`, specifying a text prompt, output filename and number of frames. E.g.\n\n```sh\n./zoom.sh \"A painting of a red telephone box spinning through a time vortex\" Telephone.png 150\n```\nIf you don't have ImageMagick installed, you can install it with ```sudo apt install imagemagick```\n\n\u003cimg src=\"./samples/zoom.gif\" width=\"256px\"\u003e\u003c/img\u003e\n\nThere is also a simple zoom video creation option available. For example:\n```sh\npython generate.py -p \"The inside of a sphere\" -zvid -i 4500 -zse 20 -vl 10 -zsc 0.97 -opt Adagrad -lr 0.15 -se 4500\n```\n\n## Random text example\n\nUse `random.sh` to make a batch of images from random text. Edit the text and number of generated images to your taste!\n\n```sh\n./random.sh\n```\n\n## Advanced options\n\nTo view the available options, use \"-h\".\n\n```sh\npython generate.py -h\n```\n\n```sh\nusage: generate.py [-h] [-p PROMPTS] [-ip IMAGE_PROMPTS] [-i MAX_ITERATIONS] [-se DISPLAY_FREQ]\n[-s SIZE SIZE] [-ii INIT_IMAGE] [-in INIT_NOISE] [-iw INIT_WEIGHT] [-m CLIP_MODEL]\n[-conf VQGAN_CONFIG] [-ckpt VQGAN_CHECKPOINT] [-nps [NOISE_PROMPT_SEEDS ...]]\n[-npw [NOISE_PROMPT_WEIGHTS ...]] [-lr STEP_SIZE] [-cuts CUTN] [-cutp CUT_POW] [-sd SEED]\n[-opt {Adam,AdamW,Adagrad,Adamax,DiffGrad,AdamP,RAdam,RMSprop}] [-o OUTPUT] [-vid] [-zvid]\n[-zs ZOOM_START] [-zse ZOOM_FREQUENCY] [-zsc ZOOM_SCALE] [-cpe PROMPT_FREQUENCY]\n[-vl VIDEO_LENGTH] [-ofps OUTPUT_VIDEO_FPS] [-ifps INPUT_VIDEO_FPS] [-d]\n[-aug {Ji,Sh,Gn,Pe,Ro,Af,Et,Ts,Cr,Er,Re} [{Ji,Sh,Gn,Pe,Ro,Af,Et,Ts,Cr,Er,Re} ...]]\n[-cd CUDA_DEVICE]\n```\n\n```sh\noptional arguments:\n  -h, --help            show this help message and exit\n  -p PROMPTS, --prompts PROMPTS\n                        Text prompts\n  -ip IMAGE_PROMPTS, --image_prompts IMAGE_PROMPTS\n                        Image prompts / target image\n  -i MAX_ITERATIONS, --iterations MAX_ITERATIONS\n                        Number of iterations\n  -se DISPLAY_FREQ, --save_every DISPLAY_FREQ\n                        Save image iterations\n  -s SIZE SIZE, --size SIZE SIZE\n                        Image size (width height) (default: [512, 512])\n  -ii INIT_IMAGE, --init_image INIT_IMAGE\n                        Initial image\n  -in INIT_NOISE, --init_noise INIT_NOISE\n                        Initial noise image (pixels or gradient)\n  -iw INIT_WEIGHT, --init_weight INIT_WEIGHT\n                        Initial weight\n  -m CLIP_MODEL, --clip_model CLIP_MODEL\n                        CLIP model (e.g. ViT-B/32, ViT-B/16)\n  -conf VQGAN_CONFIG, --vqgan_config VQGAN_CONFIG\n                        VQGAN config\n  -ckpt VQGAN_CHECKPOINT, --vqgan_checkpoint VQGAN_CHECKPOINT\n                        VQGAN checkpoint\n  -nps [NOISE_PROMPT_SEEDS ...], --noise_prompt_seeds [NOISE_PROMPT_SEEDS ...]\n                        Noise prompt seeds\n  -npw [NOISE_PROMPT_WEIGHTS ...], --noise_prompt_weights [NOISE_PROMPT_WEIGHTS ...]\n                        Noise prompt weights\n  -lr STEP_SIZE, --learning_rate STEP_SIZE\n                        Learning rate\n  -cuts CUTN, --num_cuts CUTN\n                        Number of cuts\n  -cutp CUT_POW, --cut_power CUT_POW\n                        Cut power\n  -sd SEED, --seed SEED\n                        Seed\n  -opt, --optimiser {Adam,AdamW,Adagrad,Adamax,DiffGrad,AdamP,RAdam,RMSprop}\n                        Optimiser\n  -o OUTPUT, --output OUTPUT\n                        Output file\n  -vid, --video         Create video frames?\n  -zvid, --zoom_video   Create zoom video?\n  -zs ZOOM_START, --zoom_start ZOOM_START\n                        Zoom start iteration\n  -zse ZOOM_FREQUENCY, --zoom_save_every ZOOM_FREQUENCY\n                        Save zoom image iterations\n  -zsc ZOOM_SCALE, --zoom_scale ZOOM_SCALE\n                        Zoom scale\n  -cpe PROMPT_FREQUENCY, --change_prompt_every PROMPT_FREQUENCY\n                        Prompt change frequency\n  -vl VIDEO_LENGTH, --video_length VIDEO_LENGTH\n                        Video length in seconds\n  -ofps OUTPUT_VIDEO_FPS, --output_video_fps OUTPUT_VIDEO_FPS\n                        Create an interpolated video (Nvidia GPU only) with this fps (min 10. best set to 30 or 60)\n  -ifps INPUT_VIDEO_FPS, --input_video_fps INPUT_VIDEO_FPS\n                        When creating an interpolated video, use this as the input fps to interpolate from (\u003e0 \u0026 \u003cofps)\n  -d, --deterministic   Enable cudnn.deterministic?\n  -aug, --augments {Ji,Sh,Gn,Pe,Ro,Af,Et,Ts,Cr,Er,Re} [{Ji,Sh,Gn,Pe,Ro,Af,Et,Ts,Cr,Er,Re} ...]\n                        Enabled augments\n  -cd CUDA_DEVICE, --cuda_device CUDA_DEVICE\n                        Cuda device to use\n```\n\n## Troubleshooting\n\n### CUSOLVER_STATUS_INTERNAL_ERROR\n\nFor example:\n\n`RuntimeError: cusolver error: CUSOLVER_STATUS_INTERNAL_ERROR, when calling cusolverDnCreate(handle)`\n\nMake sure you have specified the correct size for the image.\n\n### RuntimeError: CUDA out of memory\n\nFor example:\n\n`RuntimeError: CUDA out of memory. Tried to allocate 150.00 MiB (GPU 0; 23.70 GiB total capacity; 21.31 GiB already allocated; 78.56 MiB free; 21.70 GiB reserved in total by PyTorch)`\n\nYour request doesn't fit into your GPU's VRAM. Reduce the image size and/or number of cuts.\n\n\n## Citations\n\n```bibtex\n@misc{unpublished2021clip,\n    title  = {CLIP: Connecting Text and Images},\n    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},\n    year   = {2021}\n}\n```\n\n```bibtex\n@misc{esser2020taming,\n      title={Taming Transformers for High-Resolution Image Synthesis}, \n      author={Patrick Esser and Robin Rombach and Björn Ommer},\n      year={2020},\n      eprint={2012.09841},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\nKatherine Crowson - \u003chttps://github.com/crowsonkb\u003e\n\nPublic Domain images from Open Access Images at the Art Institute of Chicago - \u003chttps://www.artic.edu/open-access/open-access-images\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnerdyrodent%2FVQGAN-CLIP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnerdyrodent%2FVQGAN-CLIP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnerdyrodent%2FVQGAN-CLIP/lists"}