https://github.com/addo561/stable-diffusion
A from-scratch PyTorch implementation of Stable Diffusion focused on understanding the mathematics, architecture, and engineering behind latent diffusion models. Built by manually implementing the UNet, attention mechanisms, schedulers, CFG, and latent denoising pipeline while supporting pretrained weight injection.
https://github.com/addo561/stable-diffusion
clip cross-attention sampling unet vae
Last synced: 13 days ago
JSON representation
A from-scratch PyTorch implementation of Stable Diffusion focused on understanding the mathematics, architecture, and engineering behind latent diffusion models. Built by manually implementing the UNet, attention mechanisms, schedulers, CFG, and latent denoising pipeline while supporting pretrained weight injection.
- Host: GitHub
- URL: https://github.com/addo561/stable-diffusion
- Owner: addo561
- Created: 2026-05-15T16:53:56.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-22T19:41:49.000Z (24 days ago)
- Last Synced: 2026-05-22T22:28:01.240Z (24 days ago)
- Topics: clip, cross-attention, sampling, unet, vae
- Language: Jupyter Notebook
- Homepage:
- Size: 2.1 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🎨 Stable Diffusion Pipeline
A hands-on implementation of Stable Diffusion v1.4 inference with custom DDIM sampling,
classifier-free guidance, and inpainting — built piece by piece to understand what's actually happening under the hood.
---
## 🖼️ Inpainting Result
> Full walkthrough in [`in-painting.ipynb`](./in-painting.ipynb)

*Mask-based latent blending: the original image is preserved outside the mask, and new content is diffused inside it.*
---
## 🚀 What This Project Does
Generate images from text prompts using a custom-built sampling pipeline, with working inpainting on top.
---
## 🛠️ What I Built From Scratch
- **DDIM Sampler** — complete noise scheduling and denoising loop
- **Classifier-Free Guidance** — custom conditional/unconditional steering
- **VAE Interface** — latent encoding/decoding with proper scaling
- **CLIP Text Pipeline** — tokenization and embedding extraction
- **Inpainting Logic** — mask-based latent blending (`in-painting.ipynb`)
## 🤝 What's Integrated
- **UNet Backbone** — `UNet2DConditionModel` from 🤗 Diffusers (pre-trained weights)
---
## 💡 Why This Approach
Initial work focused on injecting weights into a fully custom UNet architecture.
684/686 layers loaded successfully, but architectural mismatches (GEGLU vs GELU activations,
upsampling order) prevented coherent outputs. Rather than paper over the issue,
the pragmatic call was to use the proven Diffusers UNet as a stable backbone while keeping
every other component custom — quality without sacrificing what was learned.
> See [`stable-diffusion.ipynb`](./stable-diffusion.ipynb) for that experiment.
---
## 🏗️ Architecture
### Inference Loop

---
### Custom UNet vs. Diffusers UNet
Prompt: *"an astronaut riding a horse"* — 35 steps each
| My Custom UNet (weight injection attempt) | Diffusers UNet (final pipeline) |
|:-----------------------------------------:|:-------------------------------:|
|
|
|
| *Garbled / incoherent output* | *Coherent, prompt-following output* |
---
## 📦 Features
- ✅ Text-to-image generation
- ✅ Configurable steps and guidance scale
- ✅ Custom DDIM sampling loop
- ✅ Inpainting with custom masks (`in-painting.ipynb`)
---
## 🔧 Usage
```bash
python inference.py -c "your prompt" -s 50 -g 7.5
```