https://koushiksrivats.github.io/robust-concept-erasing/

Official implementation of the paper "STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models"
https://koushiksrivats.github.io/robust-concept-erasing/

ai generative-ai safety-ai stable-diffusion

Last synced: 6 months ago
JSON representation

Official implementation of the paper "STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models"

Host: GitHub
URL: https://koushiksrivats.github.io/robust-concept-erasing/
Owner: koushiksrivats
Created: 2024-06-28T12:44:05.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-09-07T20:22:23.000Z (10 months ago)
Last Synced: 2024-09-07T21:32:39.922Z (10 months ago)
Topics: ai, generative-ai, safety-ai, stable-diffusion
Homepage:
Size: 9.28 MB
Stars: 14
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-diffusion-categorized - [Project

README

# ***STEREO***: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models

Koushik Srivatsan
Fahad Shamshad
Muzammal Naseer
Karthik Nandakumar

Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), UAE.

# :rocket: Release
* **(September 02, 2024)**
* Paper uploaded on arXiv.

## Abstract

** The rapid proliferation of large-scale text-to-image generation (T2IG) models has led to concerns about their potential misuse in generating harmful content. Though many methods have been proposed for erasing undesired concepts from T2IG models, they only provide a false sense of security, as recent works demonstrate that concept-erased models (CEMs) can be easily deceived to generate the erased concept through adversarial attacks. The problem of adversarially robust concept erasing without significant degradation to model utility (ability to generate benign concepts) remains an unresolved challenge, especially in the white-box setting where the adversary has access to the CEM. To address this gap, we propose an approach called ***STEREO*** that involves two distinct stages. The first stage **S**earches **T**horoughly **E**nough (**STE**) for strong and diverse adversarial prompts that can regenerate an erased concept from a CEM, by leveraging robust optimization principles from adversarial training. In the second **R**obustly **E**rase **O**nce (**REO**) stage, we introduce an anchor-concept-based compositional objective to robustly erase the target concept at one go, while attempting to minimize the degradation on model utility. By benchmarking the proposed ***STEREO*** approach against four state-of-the-art concept erasure methods under three adversarial attacks, we demonstrate its ability to achieve a better robustness vs. utility trade-off.

## Highlights
Large-scale diffusion models for text-to-image generation are susceptible to adversarial attacks that can regenerate harmful concepts despite erasure efforts. We introduce ***STEREO***, a robust approach designed to prevent this regeneration while preserving the model's ability to generate benign content.

**
***Overview of STEREO***.
We propose a novel two-stage framework for adversarially robust concept erasing from pre-trained text-to-image generation models without significantly affecting the utility for benign concepts.

**Stage 1 (top)**: Search Thoroughly Enough (STE) follows the robust optimization framework of Adversarial Training and formulates concept erasing as a min-max optimization problem, to discover strong adversarial
prompts that can regenerate target concepts from erased models. Note that, the core novelty of our approach lies in the fact that we employ AT not as a final solution, but only as an intermediate step to search thoroughly enough for strong adversarial prompts.

**Stage 2 (bottom)**: Robustly Erase Once fine-tunes the model using an anchor concept and the set of strong adversarial prompts from Stage 1 via a compositional objective, maintaining high-fidelity generation of benign concepts while robustly erasing the target concept.

## 🔜 Code and Models Coming Soon !!

## Citation
If you find our work and this repository useful, please consider giving our repo a star and citing our paper as follows:
```bibtex
@article{srivatsan2024stereo,
title={STEREO: Towards Adversarially Robust Concept Erasing from Text-to-Image Generation Models},
author={Srivatsan, Koushik and Shamshad, Fahad and Naseer, Muzammal and Nandakumar, Karthik},
journal={arXiv preprint arXiv:2408.16807},
year={2024}
}
```
## Contact
If you have any questions, please create an issue on this repository or contact at [email protected].

## Acknowledgement :pray:
Our code is built on top of the [ESD](https://github.com/rohitgandikota/erasing) repository. We thank the authors for releasing their code.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://koushiksrivats.github.io/robust-concept-erasing/

Awesome Lists containing this project

README