{"id":13422698,"url":"https://github.com/aleju/cat-generator","last_synced_at":"2025-07-22T20:33:04.758Z","repository":{"id":66066013,"uuid":"46431803","full_name":"aleju/cat-generator","owner":"aleju","description":"Generate cat images with neural networks","archived":false,"fork":false,"pushed_at":"2017-11-06T08:53:46.000Z","size":1174,"stargazers_count":374,"open_issues_count":1,"forks_count":65,"subscribers_count":13,"default_branch":"master","last_synced_at":"2025-05-20T00:03:08.538Z","etag":null,"topics":["cat","cats","dcgan","deep-learning","gan","machine-learning","torch"],"latest_commit_sha":null,"homepage":"","language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aleju.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2015-11-18T16:29:37.000Z","updated_at":"2025-03-23T10:31:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"8a6d2538-7e9c-40be-97d2-af8308d05b36","html_url":"https://github.com/aleju/cat-generator","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aleju/cat-generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fcat-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fcat-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fcat-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fcat-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aleju","download_url":"https://codeload.github.com/aleju/cat-generator/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fcat-generator/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266567583,"owners_count":23949382,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-22T02:00:09.085Z","response_time":66,"last_error":null,"robots_txt_status":null,"robots_txt_updated_at":null,"robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cat","cats","dcgan","deep-learning","gan","machine-learning","torch"],"created_at":"2024-07-30T23:00:50.630Z","updated_at":"2025-07-22T20:33:04.723Z","avatar_url":"https://github.com/aleju.png","language":"Lua","funding_links":[],"categories":["Cool Projects"],"sub_categories":[],"readme":"# About\n\nThis script generates new images of cats using the technique of generative adversarial networks (GAN), as described in [the paper](http://arxiv.org/abs/1406.2661) by [Goodfellow](https://github.com/goodfeli) et al.\nThe images are enhanced with the [laplacian pyramid technique](http://arxiv.org/abs/1506.05751) from Denton and [Soumith](https://github.com/soumith) Chintala et. al., implemented as a single G (generator) as described in the [blog post](http://torch.ch/blog/2015/11/13/gan.html) by Anders Boesen Lindbo Larsen and Søren Kaae Sønderby.\nMost of the code is based on facebook's [eyescream project](https://github.com/facebook/eyescream).\nThe script also uses code from other repositories for [spatial transformers](https://github.com/Moodstocks/gtsrb.torch/blob/master/networks.lua), [weight initialization](https://github.com/e-lab/torch-toolbox/blob/master/Weight-init/weight-init.lua) and [LeakyReLUs](https://github.com/nagadomi/waifu2x/blob/master/lib/LeakyReLU.lua).\n\n\n# Images\n\nThe following images were generated by networks trained with:\n* Model G32up, color: `th train.lua --D_iterations=2`. This model is currently *not* the default for G, i.e. must be manually activated in `models.lua`.\n* Model G32up, grayscale: `th train.lua --colorSpace=\"y\"`. See above.\n* Model G32up-c, color: `th train.lua`. This model is currently the default model/architecture for G.\n\nThe difference between model G32up and G32up-c is simply that G32up-c is about one layer deeper and has more convolution kernels.\n\n## Model G32up-c (currently default)\n\n![256 random color images](images/random_color_256_g32upc.jpg?raw=true \"256 random color images\")\n\n*256 randomly generated 32x32 cat images. (Model G32up-c)*\n\n![64 color images rated as good](images/best_color_g32upc.jpg?raw=true \"64 color images rated as good\")\n\n*64 generated 32x32 cat images, rated by D as the best images among 1024 randomly generated ones. (Model G32up-c)*\n\n![Nearest neighbours of generated 32x32 images](images/nearest_neighbours_g32upc.jpg?raw=true \"Nearest neighbours of generated 32x32 images\")\n\n*16 generated images (each pair left) and their nearest neighbours from the training set (each pair right). Distance was measured by 2-Norm (`torch.dist()`). The 16 selected images were the \"best\" ones among 1024 images according to the rating by D, hence some similarity with the training set is expected. (Model G32up-c)*\n\n[![Training progress video](images/youtube-embedded-image-g32upc.jpg?raw=true)](https://youtu.be/JRBscukr7ew)\n\n*Training progress of the network while learning to generate color images. Epoch 1 to 750 as a [youtube video](https://youtu.be/JRBscukr7ew). (Model G32up-c)*\n\n## Model G32up\n\n![256 random color images](images/random_color_256.jpg?raw=true \"256 random color images\")\n\n*256 randomly generated 32x32 cat images. (Model G32up)*\n\n![64 color images rated as good](images/best_color.jpg?raw=true \"64 color images rated as good\")\n\n*64 generated 32x32 cat images, rated by D as the best images among 1024 randomly generated ones. (Model G32up)*\n\n![1024 generated grayscale images](images/random_grayscale_1024.jpg?raw=true \"1024 random grayscale images\")\n\n*1024 randomly generated 32x32 grayscale cat images. (Model G32up)*\n\n![64 grayscale images rated as good](images/best_grayscale.jpg?raw=true \"64 grayscale images rated as good\")\n\n*64 generated 32x32 grayscale cat images, rated by D as the best images among 1024 randomly generated ones. (Model G32up)*\n\n![Nearest neighbours of generated 32x32 images](images/nearest_neighbours.jpg?raw=true \"Nearest neighbours of generated 32x32 images\")\n\n*16 generated images (each pair left) and their nearest neighbours from the training set (each pair right). Distance was measured by 2-Norm (`torch.dist()`). The 16 selected images were the \"best\" ones among 1024 images according to the rating by D, hence some similarity with the training set is expected. (Model G32up)*\n\n[![Training progress video](images/youtube-embedded-image.jpg?raw=true)](https://youtu.be/2lf2Caz8CDM)\n\n*Training progress of the network while learning to generate color images. Epoch 1 to 690 as a [youtube video](https://youtu.be/2lf2Caz8CDM). (Model G32up)*\n\n\n# Background Knowledge\n\nThe basic principle of GANs is to train two networks in a kind of forger-police-relationship.\nThe forger is called G (generator) and the police D (discriminator).\nIt is D's job to take a look at an image and estimate whether it is a fake or a real image (where \"real\" is synonymous with \"from the training set\").\nNaturally it's G's job to generate images that trick D into believing that they are from the training set.\nWith a large enough training set and some regularization strategies, D cannot just memorize the training set.\nAs a result, D must learn the general rules that govern the look of images from the training set (i.e. a generalizing function).\nSimilarly, G must learn how to \"paint\" new images that look like the ones from the training set, otherwise it would not be able to trick D.\n\nThe previously mentioned laplacian pyramid technique for GANs is pretty straight-forward:\nInstead of training G and D on full-sized images (e.g. 64x64 pixels) you train them on smaller ones (e.g. 8x8 pixels).\nAfterwards you increase the size of the generated images in multiple steps to the final size, e.g. from 8x8 to 16x16 to 32x32 to 64x64.\nFor each of these steps you train another pair of G and D\\*, but in case of these upscaling steps they are trained to learn good refinements of the upscaled (and hence blurry) images.\nThat means that D gets fed refined/sharpened images and must tell, whether these were real images (i.e. blurry images from the training set with optimal refinements) or fake images from G (i.e. blurry images from the training set, but the refinement was done by G).\nAgain, G must learn to generate good refinements and D must learn what good refined images look like.\nThe image below (taken from the paper) shows the process (they start with the full sized images, the one on the far right could be generated by a GAN).\nNote that this training methodology is similar to how one would naturally paint images: You start with a rough sketch (low resolution image) and then progressively add more and more details (increases in resolution).\n\n\\*) This project actually uses a technique that merges the laplacian pyramid into one pair of G and D. The basic principle however stays the same.\n\n![Laplacian pyramid](images/laplacian_pyramid.png?raw=true \"Laplacian pyramid\")\n\n\n# Requirements\n\n* [Torch](http://torch.ch/) with the following libraries (most of them are probably already installed by default):\n  * `nn` (`luarocks install nn`)\n  * `pl` (`luarocks install pl`)\n  * `paths` (`luarocks install paths`)\n  * `image` (`luarocks install image`)\n  * `optim` (`luarocks install optim`)\n  * `cutorch` (`luarocks install cutorch`)\n  * `cunn` (`luarocks install cunn`)\n  * `cudnn` (`luarocks install cudnn`)\n  * `dpnn` (`luarocks install dpnn`)\n  * `stn` ([see here](https://github.com/qassemoquab/stnbhwd))\n  * [display](https://github.com/szym/display)\n* Python 2.7 (only tested with that version)\n  * scipy\n  * numpy\n  * scikit-image\n* [10k cats dataset](https://web.archive.org/web/20150520175645/http://137.189.35.203/WebUI/CatDatabase/catData.html)\n* CUDA capable GPU (4GB memory or more) with cudnn3\n\n\n# Usage\n\nPreperation steps:\n* Install all requirements as listed above.\n* Download and extract the [10k cats dataset](https://web.archive.org/web/20150520175645/http://137.189.35.203/WebUI/CatDatabase/catData.html) into a directory, e.g. `/foo/bar`. That folder should then contain the subfolders `CAT_00` to `CAT_06`.\n* Clone the repository.\n* Switch to the repository's subdirectory `dataset` via `cd dataset` and convert your downloaded cat images into a normalized and augmented set of ~100k cat faces with `python generate_dataset.py --path=\"/foo/bar\"`. This may take a good two hours or so to run through, as it performs lots of augmentations.\n\nTraining and Sampling:\n* Start display with `th -ldisplay.start`\n* Open `http://localhost:8000/` in your browser (plotting interface by display).\n* Train V for a few epochs with `th train_v.lua`. (Wait for a `saving network to \u003cpath\u003e` message, then stop manually.)\n* Pretrain G for a few epochs with `th pretrain_g.lua`. (Wait for a `saving network to \u003cpath\u003e` message, then stop manually.) (This step can be skipped.)\n* Train a network with `th train.lua` for 200 epochs or more. You might have to add `--D_iterations=2` to get good results.\n* Sample images (random, best, worst images) to directory `samples/` with `th sample.lua`. Add `--neighbours` if you also want to sample nearest neighbours (from the training set) of generated images (takes a long time). Add e.g. `--run=10` to sample 10 groups of images.\n\nAdd `--colorSpace=\"y\"` to each script to work with grayscale images.\n\nNote: During training images are saved in `logs/images`, `logs/images_good` and `logs/images_bad`. They will not get deleted automatically and can accumulate over time.\n\n\n# V\n\nV (the Validator) is intended to be a half-decent replacement of validation scores, which you don't have in GANs. V's architecture is - similarly to D - a convolutional neural network.\nJust like D, V creates fake/real judgements for images, i. e. it rates how fake images look. V gets fed images generated by G and rates them. The mean of that rating can be used\nas the mentioned validation score replacement.\nV is trained once before the generator network. During its training, V sees real images from the dataset as well as synthetically generated fake images. The methods to generate the synthetic\nimages are roughly:\n* Random mixing of two images.\n* Random warping of an image (i. e. move parts of the image around, causing distortions).\n* Random stamping of an image (i. e. replace parts of the image by parts from somewhere else in the image).\n* Randomly throw random pixel values together (with some gaussian blurring technique, so that its not just gaussian noise).\n\nThese techniques are then sometimes combined with each other, e. g. one image is modified by warping, another by stamping and then both are mixed into one final synthetic image.\n\nV seems to be capable of often spotting really bad images. It is however rather bad at distinguishing the quality of good images.\nSo long as the image looks roughly like a cat, V will tend to produce a good rating. The images start to look good after epoch 50 or so, which is when V's rating isn't helpful anymore.\n\n\n# Architectures\n\nAll networks are optimized for 32x32 images. They should work with 16x16 images too. Anything else will likely result in errors.\nMost of the activations were PReLUs, because they perform better than ReLUs in my experience.\nNetworks with LeakyReLUs seemed to blow up more frequently, so I didn't use them very much.\n\n## G\n\nThe architecture of G (version G32up-c) is mostly copied from the [blog post](http://torch.ch/blog/2015/11/13/gan.html) by Anders Boesen Lindbo Larsen and Søren Kaae Sønderby.\nIt is basically a full laplacian pyramid in one network.\nThe network starts with a small linear layer, which roughly generates 4x4 images.\nThat is followed by upsampling layers, which increase the image size to 8x8 then 16x16 and then 32x32 pixels.\n\n```lua\nlocal model = nn.Sequential()\n-- 4x4\nmodel:add(nn.Linear(noiseDim, 512*4*4))\nmodel:add(nn.PReLU(nil, nil, true))\nmodel:add(nn.View(512, 4, 4))\n\n-- 4x4 -\u003e 8x8\nmodel:add(nn.SpatialUpSamplingNearest(2))\nmodel:add(cudnn.SpatialConvolution(512, 512, 3, 3, 1, 1, (3-1)/2, (3-1)/2))\nmodel:add(nn.SpatialBatchNormalization(512))\nmodel:add(nn.PReLU(nil, nil, true))\n\n-- 8x8 -\u003e 16x16\nmodel:add(nn.SpatialUpSamplingNearest(2))\nmodel:add(cudnn.SpatialConvolution(512, 256, 3, 3, 1, 1, (3-1)/2, (3-1)/2))\nmodel:add(nn.SpatialBatchNormalization(256))\nmodel:add(nn.PReLU(nil, nil, true))\n\n-- 16x16 -\u003e 32x32\nmodel:add(nn.SpatialUpSamplingNearest(2))\nmodel:add(cudnn.SpatialConvolution(256, 128, 5, 5, 1, 1, (5-1)/2, (5-1)/2))\nmodel:add(nn.SpatialBatchNormalization(128))\nmodel:add(nn.PReLU(nil, nil, true))\n\nmodel:add(cudnn.SpatialConvolution(128, dimensions[1], 3, 3, 1, 1, (3-1)/2, (3-1)/2))\nmodel:add(nn.Sigmoid())\n```\nwhere `dimensions[1]` is 3 for color and 1 for grayscale mode. `noiseDim` is a vector of size 100 with values sampled from a uniform distribution between -1 and +1.\n\nA different version of G is G32up, which is shown in some of the images at the top.\nIt is mostly identical to G32up-c, just a bit smaller:\n\n```lua\nlocal model = nn.Sequential()\nmodel:add(nn.Linear(noiseDim, 128*8*8))\nmodel:add(nn.View(128, 8, 8))\nmodel:add(nn.PReLU(nil, nil, true))\n\nmodel:add(nn.SpatialUpSamplingNearest(2))\nmodel:add(cudnn.SpatialConvolution(128, 256, 5, 5, 1, 1, (5-1)/2, (5-1)/2))\nmodel:add(nn.SpatialBatchNormalization(256))\nmodel:add(nn.PReLU(nil, nil, true))\n\nmodel:add(nn.SpatialUpSamplingNearest(2))\nmodel:add(cudnn.SpatialConvolution(256, 128, 5, 5, 1, 1, (5-1)/2, (5-1)/2))\nmodel:add(nn.SpatialBatchNormalization(128))\nmodel:add(nn.PReLU(nil, nil, true))\n\nmodel:add(cudnn.SpatialConvolution(128, dimensions[1], 3, 3, 1, 1, (3-1)/2, (3-1)/2))\nmodel:add(nn.Sigmoid())\n```\n\n## D\n\nD is a convolutional network with multiple branches.\nIt uses a spatial transformer at the start to remove rotations.\nThree of the four branches also have spatial transformers (for rotation, translation and scaling).\nAs such they can learn to focus on specific areas of the image. (I don't know if they really did learn that.)\nThe fourth branch is intended to analyze the whole image.\n\nI reused this architecture from a previous project where it seemed to improve performance slightly.\nI did not test a \"normal\" convnet architecture for this project, though such a structure performed well when I used it to generate skies, so it might work here too.\n\n![Architecture of D](images/D.png?raw=true \"Architecture of D\")\n\nAll convolutions were size-preserving. All localization networks of the spatial transformers used the same architecture.\nThe last hidden layer ended up a bit small to counteract the large concat. Might be worthwhile to test an architecture with a pooling layer in front of it and then 1024 neurons.\n\n## V\n\nThe validator is a standard convolutional network.\n```lua\nlocal model = nn.Sequential()\nlocal activation = nn.LeakyReLU\n\nmodel:add(nn.SpatialConvolution(dimensions[1], 128, 3, 3, 1, 1, (3-1)/2))\nmodel:add(activation())\nmodel:add(nn.SpatialMaxPooling(2, 2))\nmodel:add(nn.SpatialConvolution(128, 128, 3, 3, 1, 1, (3-1)/2))\nmodel:add(nn.SpatialBatchNormalization(128))\nmodel:add(activation())\nmodel:add(nn.SpatialMaxPooling(2, 2))\nmodel:add(nn.Dropout())\n\nmodel:add(nn.SpatialConvolution(128, 256, 3, 3, 1, 1, (3-1)/2))\nmodel:add(activation())\nmodel:add(nn.SpatialConvolution(256, 256, 3, 3, 1, 1, (3-1)/2))\nmodel:add(nn.SpatialBatchNormalization(256))\nmodel:add(activation())\nmodel:add(nn.SpatialMaxPooling(2, 2))\nmodel:add(nn.SpatialDropout())\nlocal imgSize = 0.25 * 0.25 * 0.25 * dimensions[2] * dimensions[3]\nmodel:add(nn.View(256 * imgSize))\n\nmodel:add(nn.Linear(256 * imgSize, 1024))\nmodel:add(nn.BatchNormalization(1024))\nmodel:add(activation())\nmodel:add(nn.Dropout())\n\nmodel:add(nn.Linear(1024, 1024))\nmodel:add(nn.BatchNormalization(1024))\nmodel:add(activation())\nmodel:add(nn.Dropout())\n\nmodel:add(nn.Linear(1024, 2))\nmodel:add(nn.SoftMax())\n```\nwhere `dimensions[1]` is 3 (color) or 1 (grayscale). `dimensions[2]` and `dimensions[3]` are both 32.\n\n(A 1-neuron sigmoid output would have probably been more logical.)\n\n\n# Dataset preprocessing\n\nAs a preprocessing step, all faces must be extracted from the 10k cats dataset.\nThe dataset contains facial keypoints for each image (ears, eyes, nose), so extracting the faces isn't too hard.\nEach of the faces gets rotated so that the eyeline is parallel to the x axis (i.e. rotations are removed).\nThat was necessary as many cat images tend to be heavily rotated, making the learning task significantly harder (though that might work now with the addition of Spatial Transformers in D).\nAfter that normalization step, the images are augmented by introducing (now small) rotations, translations, scalings, brightness changes, by flipping them horizontally and by adding minor gaussian noise.\nThe data set size is increased by that to roughly 100k images (however these images are often only marginally different, so it's not 100k images worth of information).\n\n\n# Other\n\n* Adam was used as the optimizer.\n* Batch size was 32, i.e. D would get 16 fake and 16 real images, while G would get 32 attempts to mess with D.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faleju%2Fcat-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faleju%2Fcat-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faleju%2Fcat-generator/lists"}