https://github.com/addo561/stable-diffusion

A from-scratch PyTorch implementation of Stable Diffusion focused on understanding the mathematics, architecture, and engineering behind latent diffusion models. Built by manually implementing the UNet, attention mechanisms, schedulers, CFG, and latent denoising pipeline while supporting pretrained weight injection.
https://github.com/addo561/stable-diffusion

clip cross-attention sampling unet vae

Last synced: 13 days ago
JSON representation

Host: GitHub
URL: https://github.com/addo561/stable-diffusion
Owner: addo561
Created: 2026-05-15T16:53:56.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-05-22T19:41:49.000Z (24 days ago)
Last Synced: 2026-05-22T22:28:01.240Z (24 days ago)
Topics: clip, cross-attention, sampling, unet, vae
Language: Jupyter Notebook
Homepage:
Size: 2.1 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🎨 Stable Diffusion Pipeline

A hands-on implementation of Stable Diffusion v1.4 inference with custom DDIM sampling,
classifier-free guidance, and inpainting — built piece by piece to understand what's actually happening under the hood.

---

## 🖼️ Inpainting Result

> Full walkthrough in [`in-painting.ipynb`](./in-painting.ipynb)

__results___9_12

*Mask-based latent blending: the original image is preserved outside the mask, and new content is diffused inside it.*

---

## 🚀 What This Project Does

Generate images from text prompts using a custom-built sampling pipeline, with working inpainting on top.

---

## 🛠️ What I Built From Scratch

- **DDIM Sampler** — complete noise scheduling and denoising loop
- **Classifier-Free Guidance** — custom conditional/unconditional steering
- **VAE Interface** — latent encoding/decoding with proper scaling
- **CLIP Text Pipeline** — tokenization and embedding extraction
- **Inpainting Logic** — mask-based latent blending (`in-painting.ipynb`)

## 🤝 What's Integrated

- **UNet Backbone** — `UNet2DConditionModel` from 🤗 Diffusers (pre-trained weights)

---

## 💡 Why This Approach

Initial work focused on injecting weights into a fully custom UNet architecture.
684/686 layers loaded successfully, but architectural mismatches (GEGLU vs GELU activations,
upsampling order) prevented coherent outputs. Rather than paper over the issue,
the pragmatic call was to use the proven Diffusers UNet as a stable backbone while keeping
every other component custom — quality without sacrificing what was learned.

> See [`stable-diffusion.ipynb`](./stable-diffusion.ipynb) for that experiment.

---

## 🏗️ Architecture

### Inference Loop

sd_inference_loop

---

### Custom UNet vs. Diffusers UNet

Prompt: *"an astronaut riding a horse"* — 35 steps each

---

## 📦 Features

- ✅ Text-to-image generation
- ✅ Configurable steps and guidance scale
- ✅ Custom DDIM sampling loop
- ✅ Inpainting with custom masks (`in-painting.ipynb`)

---

## 🔧 Usage

```bash
python inference.py -c "your prompt" -s 50 -g 7.5
``` __results___9_12

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/addo561/stable-diffusion

Awesome Lists containing this project

README