https://github.com/tiger-ai-lab/imagenworld

Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks
https://github.com/tiger-ai-lab/imagenworld

genai generation image

Last synced: 4 months ago
JSON representation

Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Host: GitHub
URL: https://github.com/tiger-ai-lab/imagenworld
Owner: TIGER-AI-Lab
Created: 2025-09-28T17:40:49.000Z (9 months ago)
Default Branch: main
Last Pushed: 2026-01-18T21:09:59.000Z (5 months ago)
Last Synced: 2026-01-19T05:48:41.229Z (5 months ago)
Topics: genai, generation, image
Language: Python
Homepage: https://tiger-ai-lab.github.io/ImagenWorld/
Size: 18 MB
Stars: 23
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🖼️ ImagenWorld
[![Preprint](https://img.shields.io/badge/Preprint-Available-blue.svg)](https://github.com/TIGER-AI-Lab/ImagenWorld/blob/a3200b87c1714b106bf2c55daae346634a8e9cbf/static/preprint.pdf)

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

**ImagenWorld** is a large-scale, human-centric benchmark designed to stress-test image generation models in real-world scenarios.
- **Broad coverage across 6 domains:** Artworks, Photorealistic Images, Information Graphics, Textual Graphics, Computer Graphics, and Screenshots.
- **Rich supervision:** ~3.6K condition sets and ~20K fine-grained human annotations enable comprehensive, reproducible evaluation.
- **Explainable evaluation pipeline:** We decompose generated outputs via object/segment extraction to identify entities (objects, fine-grained regions), supporting both scalar ratings and object-/segment-level failure tags.
- **Diverse model suite:** We evaluate **14 models** in total — **4 unified** (GPT-Image-1, Gemini 2.0 Flash, BAGEL, OmniGen2) and **10 task-specific** baselines (SDXL, Flux.1-Krea-dev, Flux.1-Kontext-dev, Qwen-Image, Infinity, Janus Pro, UNO, Step1X-Edit, IC-Edit, InstructPix2Pix).

[🌐 Project Page] [📄 Preprint] [💾 Datasets] [🏛️ ImagenWorld-Visualizer]

## 📰 News
* 2025 Jan 25: Accepted to ICLR 2026!
* 2025 Oct 16: ComfyUI Blog on [https://blog.comfy.org/p/introducing-imagenworld](https://blog.comfy.org/p/introducing-imagenworld)
* 2025 Oct 13: Preprint released on Github.

## 📖 Introduction

This repository contains the code for the paper [ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks]().
In this paper, We introduce **ImagenWorld**, a large-scale, human-centric benchmark designed to stress-test image generation models in real-world scenarios. Unlike prior evaluations that focus on isolated tasks or narrow domains, ImagenWorld is organized into six domains: Artworks, Photorealistic Images, Information Graphics, Textual Graphics, Computer Graphics, and Screenshots, and six tasks: Text-to-Image Generation (TIG), Single-Reference Image Generation (SRIG), Multi-Reference Image Generation (MRIG), Text-to-Image Editing (TIE), Single-Reference Image Editing (SRIE), and Multi-Reference Image Editing (MRIE). The benchmark includes 3.6K condition sets and 20K fine-grained human annotations, providing a comprehensive testbed for generative models. To support explainable evaluation, ImagenWorld applies object- and segment-level extraction to generated outputs, identifying entities such as objects and fine-grained regions. This structured decomposition enables human annotators to provide not only scalar ratings but also detailed tags of object-level and segment-level failures.

Teaser

## 🚀 Quick Start — Inference

**Tasks:** `TIG` (Text→Image Generation), `TIE` (Text→Image Editing), `SRIG`, `SRIE`, `MRIG`, `MRIE`
**Datasets:** assumes `ImagenWorld//...` layout (adjust `--task_path` as needed)

---

### Open-Source Models

**Directory:** `inference/open-source/`
**Entrypoint:** `main.py`
**Model registry:** `inference/open-source/config.py`
**Batch helper:** `open_models.sh`

All open-source and close-source runners follow a unified CLI:
```bash
python main.py --task --model --task_path --limit --verbose
```

#### 🔹 Example: TIG (Text→Image Generation) with UNO
```bash
cd inference/open-source

python main.py --task TIG --model UNO --task_path /path/to/ImagenWorld/TIG --limit 5 --verbose
```

**Explanation**
- Loads the **UNO** open-source generator from the registry (`config.py`)
- Runs the **TIG** (Text→Image Generation) task using samples from `/path/to/ImagenWorld/TIG`
- Saves results to `model_outputs/model_name.png`
- Prints per-sample logs if `--verbose` is enabled

---

### Closed-Source Models

**Directory:** `inference/closed-source/`
**Entrypoint:** `main.py`
**Model registry:** `inference/closed-source/config.py`
**Batch helper:** `closed_models.sh`

Available closed-source APIs and outputs:
- `GPT-Image-1` → saves `gpt-image-1.png`
- `Gemini2Flash` → saves `gemini.png`

#### 🔧 Setup Environment
Set your API keys before running:
```bash
export OPENAI_API_KEY="sk-..." # for GPT-Image-1
export GEMINI_API_KEY="..." # for Gemini 2.5 Flash Image Preview
```

#### 🔹 Example: TIE (Text→Image Editing) with Gemini 2.5 Flash
```bash
cd inference/closed-source

python main.py --task TIE --model Gemini2Flash --task_path /path/to/ImagenWorld/TIE --limit 5 --verbose
```

**Explanation**
- Loads the selected **closed-source API model** (via OpenAI or Gemini)
- Runs the specified task on samples from `/path/to/ImagenWorld/`
- Stores generated images (e.g., `gpt-image-1.png`, `gemini.png`)

---

### Batch Execution (Optional)

Each inference type includes a shell helper for multi-task runs:

```bash
# open-source batch
cd inference/open-source
bash open_models.sh

# closed-source batch
cd inference/closed-source
bash closed_models.sh
```

In both scripts:
- Set `BASE_PATH` → dataset root (e.g., `/path/to/ImagenWorld`)
- Define `TASK_MODELS` to map each task to a model
- Set API keys for closed-source models

## Citation

If you find our work useful for your research, please consider citing our paper:

```bibtex
@misc{imagenworld2025,
title = {ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks},
author = {Samin Mahdizadeh Sani and Max Ku and Nima Jamali and Matina Mahdizadeh Sani and Paria Khoshtab and Wei-Chieh Sun and Parnian Fazel and Zhi Rui Tam and Thomas Chong and Edisy Kin Wai Chan and Donald Wai Tong Tsang and Chiao-Wei Hsu and Ting Wai Lam and Ho Yin Sam Ng and Chiafeng Chu and Chak-Wing Mak and Keming Wu and Hiu Tung Wong and Yik Chun Ho and Chi Ruan and Zhuofeng Li and I-Sheng Fang and Shih-Ying Yeh and Ho Kei Cheng and Ping Nie and Wenhu Chen},
year = {2025},
doi = {10.5281/zenodo.17344183},
url = {https://zenodo.org/records/17344183},
projectpage = {https://tiger-ai-lab.github.io/ImagenWorld/},
blogpost = {https://blog.comfy.org/p/introducing-imagenworld},
note = {Community-driven dataset and benchmark release, Temporarily archived on Zenodo while arXiv submission is under moderation review.},
}
```

```bibtex
@article{imagenworld2026,
title={ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks},
url={http://dx.doi.org/10.36227/techrxiv.176800878.82723313/v1},
DOI={10.36227/techrxiv.176800878.82723313/v1},
publisher={Institute of Electrical and Electronics Engineers (IEEE)},
author={Sani, Samin Mahdizadeh and Ku, Max and Jamali, Nima and Sani, Matina Mahdizadeh and Khoshtab, Paria and Sun, Wei-Chieh and Fazel, Parnian and Tam, Zhi Rui and Chong, Thomas and Chan, Edisy Kin Wai and Tsang, Donald Wai Tong and Hsu, Chiao-Wei and Lam, Ting Wai and Ng, Ho Yin Sam and Chu, Chiafeng and Mak, Chak-Wing and Wu, Keming and Wong, Hiu Tung and Ho, Yik Chun and Ruan, Chi and Li, Zhuofeng and Fang, I-Sheng and Yeh, Shih-Ying and Cheng, Ho Kei and Nie, Ping and Chen, Wenhu},
year={2026},
month=jan }
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tiger-ai-lab/imagenworld

Awesome Lists containing this project

README