Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
https://github.com/eps696/aphantasia

CLIP + FFT/DWT/RGB = text to image/video
https://github.com/eps696/aphantasia
clip text-to-image text-to-video
Last synced: 3 months ago
JSON representation
CLIP + FFT/DWT/RGB = text to image/video
Host: GitHub
URL: https://github.com/eps696/aphantasia
Owner: eps696
License: mit
Created: 2021-02-28T03:12:55.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2023-10-19T23:05:24.000Z (8 months ago)
Last Synced: 2024-01-16T10:06:20.428Z (5 months ago)
Topics: clip, text-to-image, text-to-video
Language: Python
Homepage:
Size: 35.2 MB
Stars: 763
Watchers: 23
Forks: 104
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists

Awesome-Text-to-Image - [Github
awesome-stars - aphantasia
awesome - eps696/aphantasia - CLIP + FFT/DWT/RGB = text to image/video (Python)
awesome-generative-ai - 🔥
awesome-stars - eps696/aphantasia - CLIP + FFT/DWT/RGB = text to image/video (Python)
README

        # Aphantasia



This is a collection of text-to-image tools, evolved from the [artwork] of the same name.   

Based on [CLIP] model and [Lucent] library, with FFT/DWT/RGB parameterizers (no-GAN generation).  

*Updated: **Illustrip** (text-to-video with motion and depth) is added.*  

*Updated: DWT (wavelets) parameterization is added.*  

*Updated: Check also colabs below, with VQGAN and SIREN+FFM generators.*  

Tested on Python 3.7 with PyTorch 1.7.1 or 1.8.

*[Aphantasia] is the inability to visualize mental images, the deprivation of visual dreams.  

The image in the header is generated by the tool from this word.*

**Please be kind to mention this project, if you employ it for your masterpieces**

## Features

* generating massive detailed textures, a la deepdream

* fullHD/4K resolutions and above

* various CLIP models (including multi-language from [SBERT])

* continuous mode to process phrase lists (e.g. illustrating lyrics)

	* pan/zoom motion with smooth interpolation

	* direct RGB pixels optimization (very stable)

	* depth-based 3D look (courtesy of [deKxi], based on [AdaBins])

* complex queries:

	* text and/or image as main prompts

	* separate text prompts for style and to subtract (avoid) topics

	* starting/resuming process from saved parameters or from an image

Setup [CLIP] et cetera:

```

pip install -r requirements.txt

pip install git+https://github.com/openai/CLIP.git

```

## Operations

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eps696/aphantasia/blob/master/Aphantasia.ipynb)

* Generate an image from the text prompt (set the size as you wish):

```

python clip_fft.py -t "the text" --size 1280-720

```

* Reproduce an image:

```

python clip_fft.py -i theimage.jpg --sync 0.4

```

If `--sync X` argument > 0, [LPIPS] loss is added to keep the composition similar to the original image. 

You can combine both text and image prompts.  

For non-English languages use either `--multilang` (multi-language CLIP model, trained with ViT) or `--translate` (Google translation, works with any visual model). 

* Set more specific query like this:

```

python clip_fft.py -t "topic sentence" -t2 "style description" -t0 "avoid this" --size 1280-720 

```

* Other options:  

Text inputs understand syntax with weights, like `good prompt :1 | also good prompt :1 | bad prompt :-0.5`.  

`--model M` selects one of the released CLIP visual models: `ViT-B/32` (default), `ViT-B/16`, `RN50`, `RN50x4`, `RN50x16`, `RN101`.  

One can also set `--dualmod` to use `ViT-B/32` and `ViT-B/16` at once (preferrable).  

`--dwt` switches to DWT (wavelets) generator instead of FFT. There are few methods, chosen by `--wave X`, e.g. `db2`, `db3`, `coif1`, `coif2`, etc.  

`--align XX` option is about composition (or sampling distribution, to be more precise): `uniform` is maybe the most adequate; `overscan` can make semi-seamless tileable textures.  

`--steps N` sets iterations count. 100-200 is enough for a starter; 500-1000 would elaborate it more thoroughly.  

`--samples N` sets amount of the image cuts (samples), processed at one step. With more samples you can set fewer iterations for similar result (and vice versa). 200/200 is a good guess. NB: GPU memory is mostly eaten by this count (not resolution)!  

`--aest X` enforces overall cuteness by employing [aesthetic loss](https://github.com/LAION-AI/aesthetic-predictor). try various values (may be negative).  

`--decay X` (compositional softness), `--colors X` (saturation) and `--contrast X` may be useful, especially for ResNet models (they tend to burn the colors). 

`--sharp X` may be useful to increase sharpness, if the image becomes "myopic" after increasing `decay`. it affects the other color parameters, better tweak them all together! 

Current defaults are `--decay 1.5 --colors 1.8 --contrast 1.1 --sharp 0`.  

`--transform X` applies some augmentations, usually enhancing result (but slower). there are few choices; `fast` seems optimal.  

`--optimizer` can be `adam`, `adamw`, `adam_custom` or `adamw_custom`. Custom options are noiser but stable; pure `adam` is softer, but may tend to colored blurring.  

`--invert` negates the whole criteria, if you fancy checking "totally opposite".  

`--save_pt myfile.pt` will save FFT/DWT parameters, to resume for next query with `--resume myfile.pt`. One can also start/resume directly from an image file.  

`--opt_step N` tells to save every Nth frame (useful with high iterations, default is 1).  

`--verbose` ('on' by default) enables some printouts and realtime image preview.  

* Some experimental tricks with less definite effects:  

`--enforce X` adds more details by boosting similarity between two parallel samples. good start is ~0.1.  

`--expand X` boosts diversity by enforcing difference between prev/next samples. good start is ~0.3.  

`--noise X` adds some noise to the parameters, possibly making composition less clogged (in a degree).  

`--macro X` (from 0 to 1) shifts generation to bigger forms and less disperse composition. should not be too close to 1, since the quality depends on the variety of samples.  

`--prog` sets progressive learning rate (from 0.1x to 2x of the one, set by `lrate`). it may boost macro forms creation in some cases (see more [here](https://github.com/eps696/aphantasia/issues/2)).  

`--lrate` controls learning rate. The range is quite wide (tested at least within 0.001 to 10).  

## Text-to-video [continuous mode]

Here is two ways of making video from the text file(s), processing it line by line in one shot.

### Illustrip

New method, interpolating topics as a constant flow with permanent pan/zoom motion and optional 3D look.  

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eps696/aphantasia/blob/master/IllusTrip3D.ipynb)

* Make video from two text files, processing them line by line, rendering 100 frames per line:

```

python illustrip.py --in_txt mycontent.txt --in_txt2 mystyles.txt --size 1280-720 --steps 100

```

* Make video from two phrases, with total length 500 frames:

```

python illustrip.py --in_txt "my super content" --in_txt2 "my super style" --size 1280-720 --steps 500

```

Prefixes (`-pre`), postfixes (`-post`) and "stop words" (`--in_txt0`) may be loaded as phrases or text files as well.  

All text inputs understand syntax with weights, like `good prompt :1 | also good prompt :1 | bad prompt :-0.5` (within one line).  

One can also use image(s) as references with `--in_img` argument. Explore other arguments for more explicit control.  

This method works best with direct RGB pixels optimization, but can also be used with FFT parameterization:

```

python illustrip.py ... --gen FFT --smooth --align uniform --colors 1.8 --contrast 1.1

```

To add 3D look, download [AdaBins model] to the main directory, and add `--depth 0.01` to the command.

### Illustra 

Generates separate images for every text line (with sequences and training videos, as in single-image mode above), then renders final video from those (mixing images in FFT space) of the `length` duration in seconds.  

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eps696/aphantasia/blob/master/Illustra.ipynb)

* Make video from a text file, processing it line by line:

```

python illustra.py -t mysong.txt --size 1280-720 --length 155

```

There is `--keep X` parameter, controlling how well the next line/image generation follows the previous. 0 means it's randomly initiated, the higher - the stricter it will keep the original composition. Safe values are 1~2 (much higher numbers may cause the imagery getting stuck). 

* Make video from a directory with saved *.pt snapshots (just interpolate them):

```

python interpol.py -i mydir --length 155

```

## Other generators

* VQGAN from [Taming Transformers](https://github.com/CompVis/taming-transformers)  

One of the best methods for colors/tones/details (especially with new Gumbel-F8 model); has quite limited resolution though (~800x600 max on Colab).  

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eps696/aphantasia/blob/master/CLIP_VQGAN.ipynb) 

* Continuous mode with VQGAN (analog of Illustra)  

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1H7vHyS6mRU7gW59lfi2Qa6M-RXKIJ8Si)



* CPPN + [export to HLSL shaders](https://github.com/wxs/cppn-to-glsl)  

One of the very first methods, with exports for TouchDesigner, vvvv, Shadertoy, etc.  

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Kbbbwoet3igHPJ4KpNh8z3V-RxtstAcz)  

```

python cppn.py -v -t "the text" --aest 0.5

```

* SIREN + [Fourier feature modulation](https://github.com/tancik/fourier-feature-networks)  

Another early method, not so interesting on its own.  

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1L14q4To5rMK8q2E6whOibQBnPnVbRJ_7)  



## Credits

Based on [CLIP] model by OpenAI ([paper]).  

FFT encoding is taken from [Lucent] library, 3D depth processing made by [deKxi].

Thanks to [Ryan Murdock], [Jonathan Fly], [Hannu Toyryla], [@eduwatch2], [torridgristle] for ideas.



[artwork]: 

[Aphantasia]: 

[CLIP]: 

[SBERT]: 

[Lucent]: 

[AdaBins]: 

[AdaBins model]: 

[LPIPS]: 

[Taming Transformers]: 

[Ryan Murdock]: 

[Jonathan Fly]: 

[Hannu Toyryla]: 

[@eduwatch2]: 

[torridgristle]: 

[deKxi]: 

[paper]: