https://github.com/laion-ai/ldm-finetune
Home of `erlich` and `ongo`. Finetune latent-diffusion/glid-3-xl text2image on your own data.
https://github.com/laion-ai/ldm-finetune
Last synced: about 1 month ago
JSON representation
Home of `erlich` and `ongo`. Finetune latent-diffusion/glid-3-xl text2image on your own data.
- Host: GitHub
- URL: https://github.com/laion-ai/ldm-finetune
- Owner: LAION-AI
- License: mit
- Created: 2022-06-01T15:28:27.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-08-05T14:58:27.000Z (almost 3 years ago)
- Last Synced: 2025-05-07T18:13:38.825Z (about 1 month ago)
- Language: Python
- Homepage:
- Size: 4.39 MB
- Stars: 181
- Watchers: 5
- Forks: 19
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# `ldm-finetune`
CompVis `latent-diffusion` finetuned on art (ongo), logo (erlich) and pixel-art (puck) generation.
This repo is modified from [glid-3-xl](https://github.com/jack000/glid-3-xl). Aesthetic CLIP embeds are provided by [aesthetic-predictor](https://github.com/LAION-AI/aesthetic-predictor)
- [`ldm-finetune`](#ldm-finetune)
- [Quick start (docker required)](#quick-start-docker-required)
- [Setup](#setup)
- [Prerequisites](#prerequisites)
- [Pytorch](#pytorch)
- [Install ldm-finetune](#install-ldm-finetune)
- [Checkpoints](#checkpoints)
- [Foundation/Backbone models:](#foundationbackbone-models)
- [Latent Diffusion Stage 2 (diffusion)](#latent-diffusion-stage-2-diffusion)
- [(recommended) jack000 - `inpaint.pt`](#recommended-jack000---inpaintpt)
- [LAION Finetuning Checkpoints](#laion-finetuning-checkpoints)
- [Erlich](#erlich)
- [Ongo](#ongo)
- [LAION - `puck.pt`](#laion---puckpt)
- [Other](#other)
- [Generating images](#generating-images)
- [Docker/cog](#dockercog)
- [Flask API](#flask-api)
- [Python](#python)
- [Autoedit](#autoedit)
- [Finetuning](#finetuning)## Quick start (docker required)
- Install [docker](https://docs.docker.com/get-docker/)
- Install [cog](https://github.com/replicate/cog/)The following command will download all weights and run a prediction with your inputs inside a proper docker container.
```sh
cog predict r8.im/laion-ai/erlich \
-i prompt="an armchair in the form of an avocado" \
-i negative="" \
-i init_image=@path/to/image \
-i mask=@path/to/mask \
-i guidance_scale=5.0 \
-i steps=100 \
-i batch_size=4 \
-i width=256 \
-i height=256 \
-i init_skip_fraction=0.0 \
-i aesthetic_rating=9 \
-i aesthetic_weight=0.5 \
-i seed=-1 \
-i intermediate_outputs=False
```Valid remote image URL's are:
- `r8.im/laion-ai/erlich`
- `r8.im/laion-ai/ongo`
- `r8.im/laion-ai/puck`## Setup
### Prerequisites
Please ensure the following dependencies are installed prior to building this repo:
- build-essential
- libopenmpi-dev
- liblzma-dev
- zlib1g-dev### Pytorch
It's a good idea to use a virtual environment or a conda environment.
```bash
python3 -m venv .venv
source venv/bin/activate
(venv) $
```Before installing, you should install pytorch manually by following the instructions at [pytorch.org](https://pytorch.org/get-started/locally/)
```bash
(venv) $ pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html
```To check your cuda version, run `nvidia-smi`.
### Install ldm-finetune
You can now install this repo by running `pip install -e .` in the project directory.
```bash
(venv) $ git clone https://github.com/laion-ai/ldm-finetune.git
(venv) $ cd ldm-finetune
(venv) $ pip install -e .
(venv) $ pip install -r requirements.txt
```## Checkpoints
### Foundation/Backbone models:
```sh
# OpenAI CLIP ViT-L/14
wget -P /root/.cache/clip "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt### BERT Text Encoder
wget --continue https://dall-3.com/models/glid-3-xl/bert.pt### kl-f8 VAE backbone
wget --continue https://dall-3.com/models/glid-3-xl/kl-f8.pt
```### Latent Diffusion Stage 2 (diffusion)
There are several stage 2 checkpoints to choose from:### (recommended) jack000 - `inpaint.pt`
The second finetune from jack000's [glid-3-xl](https://github.com/jack000/glid-3-xl) adds support for inpainting and can be used for unconditional output as well by setting the inpaint `image_embed` to zeros. Additionally finetuned to use the CLIP text embed via cross-attention (similar to unCLIP).
wget --continue https://dall-3.com/models/glid-3-xl/inpaint.pt
### LAION Finetuning Checkpoints
Laion also finetuned `inpaint.pt` with the aim of improving logo generation and painting generation.
#### Erlich
`erlich` is [inpaint.pt](https://dall-3.com/models/glid-3-xl/inpaint.pt) finetuned on a dataset collected from LAION-5B named `Large Logo Dataset`. It consists of roughly 100K images of logos with captions generated via BLIP using aggressive re-ranking and filtering.```sh
wget --continue -O erlich.pt https://huggingface.co/laion/erlich/resolve/main/model/ema_0.9999_120000.pt
```> ["You know aviato?"](https://www.youtube.com/watch?v=7Q9nQXdzNd0&t=39s)
#### Ongo
Ongo is [inpaint.pt](https://dall-3.com/models/glid-3-xl/inpaint.pt) finetuned on the Wikiart dataset consisting of about 100K paintings with captions generated via BLIP using aggressive re-ranking and filtering. We also make use of the original captions which contain the author name and the painting title.```sh
wget https://huggingface.co/laion/ongo/resolve/main/ongo.pt
```> ["Ongo Gablogian, the art collector. Charmed, I'm sure."](https://www.youtube.com/watch?v=CuMO5q1Syek)
#### LAION - `puck.pt`
`puck` has been trained on pixel art. While the underlying kl-f8 encoder seems to struggle somewhat with pixel art, results are still interesting.
```sh
wget https://huggingface.co/laion/puck/resolve/main/puck.pt
```#### Other
```
### CompVis - `diffusion.pt`
# The original checkpoint from CompVis trained on `LAION-400M`. May output watermarks.
wget --continue https://dall-3.com/models/glid-3-xl/diffusion.pt### jack000 - `finetune.pt`
# The first finetune from jack000's [glid-3-xl](https://github.com/jack000/glid-3-xl). Modified to accept a CLIP text embed and finetuned on curated data to help with watermarks. Doesn't support inpainting.
# wget https://dall-3.com/models/glid-3-xl/finetune.pt
```## Generating images
You can run prediction via python or docker. Currently the docker method is best supported.
### Docker/cog
If you have access to a linux machine (or WSL2.0 on Windows 11) with docker installed, you can very easily run models by installing `cog`:
```sh
sudo curl -o /usr/local/bin/cog -L https://github.com/replicate/cog/releases/latest/download/cog_`uname -s`_`uname -m`
sudo chmod +x /usr/local/bin/cog
```Modify the `MODEL_PATH` in `cog_sample.py`:
```python
MODEL_PATH = "erlich.pt" # Can be erlich, ongo, puck, etc.
```Now you can run predictions via docker container using:
```sh
cog predict -i prompt="a logo of a fox made of fire"
```Output will be returned as a base64 string at the end of generation and is also saved locally at `current_{batch_idx}.png`
### Flask API
If you'd like to stand up your own ldm-finetune Flask API, you can run:
```sh
cog build -t my_ldm_image
docker run -d -p 5000:5000 --gpus all my_ldm_image
```Predictions can then be accessed via HTTP:
```sh
curl http://localhost:5000/predictions -X POST \
-H 'Content-Type: application/json' \
-d '{"input": {"prompt": "a logo of a fox made of fire"}}'
```The output from the API will be a list of base64 strings representing your generations.
### Python
You can also use the standalone python scripts from `glid-3-xl`.
```bash
# fast PLMS sampling
(venv) $ python sample.py --model_path erlich.pt --batch_size 6 --num_batches 6 --text "a cyberpunk girl with a scifi neuralink device on her head"# sample with an init image
(venv) $ python sample.py --init_image picture.jpg --skip_timesteps 10 --model_path ongo.pt --batch_size 6 --num_batches 6 --text "a cyberpunk girl with a scifi neuralink device on her head"
```### Autoedit
> Autoedit uses the inpaint model to give the ldm an image prompting function (that works differently from --init_image)
> It continuously edits random parts of the image to maximize clip score for the text prompt```bash
$ (venv) python autoedit.py \
--model_path inpaint.pt --kl_path kl-f8.pt --bert_path bert.pt \
--text "high quality professional pixel art" --negative "" --prefix autoedit_generations \
--batch_size 16 --width 256 --height 256 --iterations 25 \
--starting_threshold 0.6 --ending_threshold 0.5 \
--starting_radius 5 --ending_radius 0.1 \
--seed -1 --guidance_scale 5.0 --steps 30 \
--aesthetic_rating 9 --aesthetic_weight 0.5 --wandb_name my_autoedit_wandb_artifact
```## Finetuning
See the script below for an example of finetuning your own model from one of the available chekcpoints.
Finetuning Tips/Tricks
- NVIDIA GPU required. You will need an A100 or better to use a batch size of 64. Using less may present stability issues.
- Monitor the `grad_norm` in the output log. If it ever goes above 1.0 the checkpoint may be ruined due to exploding gradients.
- to fix, try reducing the learning rate, decreasing the batch size.
- Train in 32-bit
- Resume with saved optimizer state when possible.```bash
#!/bin/bash
# Finetune glid-3-xl inpaint.pt on your own webdataset.
# Note: like all one-off scripts, this is likely to become out of date at some point.
# running python scripts/image_train_inpaint.py --help will give you more info.# model flags
use_fp16=False # TODO can cause more trouble than it's worth.
MODEL_FLAGS="--dropout 0.1 --attention_resolutions 32,16,8 --class_cond False --diffusion_steps 1000 --image_size 32 --learn_sigma False --noise_schedule linear --num_channels 320 --num_heads 8 --num_res_blocks 2 --resblock_updown False --use_fp16 $use_fp16 --use_scale_shift_norm False"# checkpoint flags
resume_checkpoint="inpaint.pt"
kl_model="kl-f8.pt"
bert_model="bert.pt"# training flags
epochs=80
shard_size=512
batch_size=32
microbatch=-1
lr=1e-6 # lr=1e-5 seems to be stable. going above 3e-5 is not stable.
ema_rate=0.9999 # TODO you may want to lower this to 0.999, 0.99, 0.95, etc.
random_crop=False
random_flip=False
cache_dir="cache"
image_key="jpg"
caption_key="txt"
data_dir=/my/custom/webdataset/ # TODO set this to a real path# interval flags
sample_interval=100
log_interval=1
save_interval=2000CKPT_FLAGS="--kl_model $kl_model --bert_model $bert_model --resume_checkpoint $resume_checkpoint"
INTERVAL_FLAGS="--sample_interval $sample_interval --log_interval $log_interval --save_interval $save_interval"
TRAIN_FLAGS="--epochs $epochs --shard_size $shard_size --batch_size $batch_size --microbatch $microbatch --lr $lr --random_crop $random_crop --random_flip $random_flip --cache_dir $cache_dir --image_key $image_key --caption_key $caption_key --data_dir $data_dir"
COMBINED_FLAGS="$MODEL_FLAGS $CKPT_FLAGS $TRAIN_FLAGS $INTERVAL_FLAGS"
export OPENAI_LOGDIR=./erlich_on_pixel_logs_run6_part2/
export TOKENIZERS_PARALLELISM=false# TODO comment out a line below to train either on a single GPU or multi-GPU
# single GPU
# python scripts/image_train_inpaint.py $COMBINED_FLAGS# or multi-GPU
# mpirun -n 8 python scripts/image_train_inpaint.py $COMBINED_FLAGS
```