https://github.com/rzem-ai/rzem-ai-inference-engine

Text-to-image inference engine with job queue, automatic VRAM management, and LoRA support. Unified API across FLUX, Z-Image, and Qwen-Image model families.
https://github.com/rzem-ai/rzem-ai-inference-engine

desktop-application flux1-dev flux2-dev image-generation qwen-image z-image

Last synced: 2 months ago
JSON representation

Text-to-image inference engine with job queue, automatic VRAM management, and LoRA support. Unified API across FLUX, Z-Image, and Qwen-Image model families.

Host: GitHub
URL: https://github.com/rzem-ai/rzem-ai-inference-engine
Owner: rzem-ai
License: gpl-3.0
Created: 2026-02-09T11:58:09.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-02-25T23:28:26.000Z (4 months ago)
Last Synced: 2026-02-26T01:47:05.327Z (4 months ago)
Topics: desktop-application, flux1-dev, flux2-dev, image-generation, qwen-image, z-image
Language: Python
Homepage:
Size: 310 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # RZEM AI Inference Engine

A Python text-to-image inference engine with job queue, event callbacks, and automatic VRAM management. Supports multiple model families with a unified API that hides architectural differences while exposing full parameter control.

## Supported Models

| Model Family | Transformer | Text Encoder | VAE | Default Steps |

|---|---|---|---|---|

| **FLUX.1 Dev** | `FluxTransformer2DModel` | CLIP + T5-XXL | 16-ch AutoencoderKL | 20 |

| **FLUX.2 Dev** | `Flux2Transformer2DModel` | Qwen3 (multi-layer) | 32-ch AutoencoderKL + BN | 20 |

| **Z-Image** | `ZImageTransformer2DModel` (S3-DiT) | Qwen3-4B | 16-ch AutoencoderKL | 9 |

| **Qwen-Image** | `QwenImageTransformer2DModel` (20B MMDiT) | Qwen3 | 16-ch AutoencoderKL | 50 |

| **FAL.ai Cloud** | Remote (fal-ai endpoints) | N/A | N/A | Endpoint-dependent |

All local models support LoRA weight patching with automatic format detection (Kohya, Diffusers/PEFT, XLabs, AIToolkit, OneTrainer).

FAL.ai cloud generation delegates to remote endpoints (e.g. `fal-ai/flux/dev`, `fal-ai/flux-pro/v1.1`, `fal-ai/flux-2`) — no local models are loaded.

## Installation

```bash

uv sync    # or: pip install -e .

```

Requires Python 3.10+ and PyTorch 2.0+. GPU with 24+ GB VRAM recommended. Apple Silicon M3+ with PyTorch 2.3+ is supported (MPS backend).

## Distribution

### Standalone Executable

Build a standalone executable that bundles all dependencies (models are still user-provided):

```bash

# Build server variant (generate + serve commands, ~3-4 GB)

bash scripts/build_executable.sh server

# Build CLI variant (generate only, ~2.5-3.5 GB)

bash scripts/build_executable.sh cli

# Build both variants (default)

bash scripts/build_executable.sh all

# Run the executables

./dist/rzem-ai-inference-engine-server/rzem-ai-inference-engine-server generate --help

./dist/rzem-ai-inference-engine-server/rzem-ai-inference-engine-server serve --help

./dist/rzem-ai-inference-engine-cli/rzem-ai-inference-engine-cli generate --help

```

See [packaging/README.md](packaging/README.md) for detailed build instructions, platform requirements, and distribution options.

**Note:** Executables do not include model weights. Models must be downloaded separately via HuggingFace Hub or provided as local paths.

## Usage

### Python API

```python

from rzem_ai_inference_engine import InferenceEngine, JobParams, EventType, TransformerType

engine = InferenceEngine()

# Listen for events

engine.on(EventType.JOB_PROGRESS, lambda e: print(f"Step {e.step}/{e.total_steps}"))

engine.on(EventType.JOB_COMPLETED, lambda e: e.image.save("output.png"))

# Submit a job

job_id = engine.submit(JobParams(

    prompt="a cat sitting on a windowsill, golden hour lighting",

    transformer_model="black-forest-labs/FLUX.1-dev",

    transformer_type=TransformerType.FLUX1_DEV,

    clip_tokenizer="black-forest-labs/FLUX.1-dev",

    clip_encoder="black-forest-labs/FLUX.1-dev",

    t5_tokenizer="black-forest-labs/FLUX.1-dev",

    t5_encoder="black-forest-labs/FLUX.1-dev",

    vae_model="black-forest-labs/FLUX.1-dev",

    steps=20,

    cfg_scale=3.5,

    width=1024,

    height=1024,

    seed=42,

))

# ... engine processes the job in a background thread ...

# Call engine.shutdown() when done

```

### CLI

```bash

# FLUX.1 Dev — all models from Black Forest Labs

rzem-ai-inference-engine generate \

    --prompt "a cat sitting on a windowsill, golden hour lighting" \

    --transformer-model black-forest-labs/FLUX.1-dev \

    --transformer-type flux1_dev \

    --clip-tokenizer black-forest-labs/FLUX.1-dev \

    --clip-encoder black-forest-labs/FLUX.1-dev \

    --t5-tokenizer black-forest-labs/FLUX.1-dev \

    --t5-encoder black-forest-labs/FLUX.1-dev \

    --vae-model black-forest-labs/FLUX.1-dev \

    --steps 20 --cfg-scale 3.5 \

    --width 1024 --height 1024 \

    --seed 42 \

    --output output.png

# Z-Image Turbo

rzem-ai-inference-engine generate \

    --prompt "mountain landscape at sunset" \

    --transformer-model Tongyi-MAI/Z-Image-Turbo \

    --transformer-type z_image \

    --qwen3-tokenizer Qwen/Qwen3-4B \

    --qwen3-encoder Qwen/Qwen3-4B \

    --vae-model black-forest-labs/FLUX.1-dev \

    --steps 9 --cfg-scale 1.0 \

    --output output.png

# With LoRAs (format: path:strength)

rzem-ai-inference-engine generate \

    ... \

    --lora ./models/anime.safetensors:0.8 \

    --lora ./models/detail.safetensors:0.5 \

    --output output.png

```

### Model Paths

Model path arguments accept three formats:

| Format | Example | Behavior |

|---|---|---|

| Local path | `./models/flux.safetensors` | Used directly |

| HuggingFace repo | `black-forest-labs/FLUX.1-dev` | Downloads full repo, auto-resolves subfolders |

| HuggingFace repo + file | `city96/FLUX.1-dev-gguf/flux1-dev-Q8_0.gguf` | Downloads only the specified file |

Multi-component HF repos (like `black-forest-labs/FLUX.1-dev`) are handled automatically — the engine resolves subfolders like `transformer/`, `text_encoder/`, `vae/`, etc.

## REST API Server

Start the server to accept jobs over HTTP and receive real-time updates via WebSocket:

```bash

# Listen on all interfaces (enables LAN discovery)

rzem-ai-inference-engine serve --host 0.0.0.0 --port 8000 --device auto --output-dir ./output

# Or run as a background daemon

bash scripts/server.sh start --port 8000 --device cuda

bash scripts/server.sh status

bash scripts/server.sh stop

```

| Endpoint | Description |

|---|---|

| `POST /jobs` | Submit a generation job |

| `GET /jobs` | List all jobs |

| `GET /jobs/{id}` | Get job details |

| `GET /jobs/{id}/image` | Download generated image |

| `DELETE /jobs/{id}` | Cancel a job |

| `GET /models` | List locally cached HuggingFace models |

| `GET /health` | Health check with queue stats |

| `WS /ws` | WebSocket — real-time job event broadcasts (JSON) |

### Network Announcement (LAN Discovery)

When the server binds to a non-localhost address (e.g. `--host 0.0.0.0`), it automatically announces itself on the local network via **mDNS/DNS-SD** — the same protocol used by AirPlay, Chromecast, and network printers. Client applications can discover running servers without manual configuration.

- **Service type**: `_rzem-ai._tcp.local.`

- **TXT record**: `version`, `device`, `api=rest`, `ws=/ws`

- **Disable**: `--no-announce`

When bound to `127.0.0.1` (the default), announcement is skipped since the server is only reachable locally.

**Client-side discovery example:**

```python

from zeroconf import ServiceBrowser, ServiceListener, Zeroconf

class Listener(ServiceListener):

    def add_service(self, zc: Zeroconf, type_: str, name: str) -> None:

        info = zc.get_service_info(type_, name)

        if info:

            addr = info.parsed_addresses()[0]

            port = info.port

            print(f"Found server at {addr}:{port}")

zc = Zeroconf()

browser = ServiceBrowser(zc, "_rzem-ai._tcp.local.", Listener())

```

## VRAM Management

The engine manages a two-tier model cache:

- **VRAM (hot)**: Models actively on the GPU for fast inference

- **RAM (warm)**: Models offloaded to CPU, moved back to GPU on demand

When VRAM is insufficient for the next model, the cache evicts the **smallest unlocked** model first. A 1 GB working memory buffer is reserved for intermediate tensors.

For FLUX.1 Dev on GPUs with ~32 GB VRAM, the pipeline automatically sequences model loading: text encoders are used then released before the 22 GB transformer is loaded.

### Dtype Selection

All pipelines resolve dtype automatically via `preferred_dtype(device)`:

| Device | Dtype | Notes |

|---|---|---|

| CUDA (Ampere+) | bfloat16 | Native tensor-core support |

| MPS (M3+) | bfloat16 | Native GPU ALU support (PyTorch 2.3+) |

| CPU | float32 | No reduced-precision benefit |

### Apple Silicon Notes

- On MPS with unified memory, cache eviction is effectively a no-op — all models share the same physical RAM. Use `--vram-limit` to constrain memory if needed.

- GGUF quantized models are slower on MPS than full-precision bfloat16 due to unoptimized dequantization kernels. Use full-precision models for best performance.

## Events

| Event | Payload | When |

|---|---|---|

| `JOB_QUEUED` | `QueuedEvent` | Job added to queue |

| `JOB_STARTED` | `StartedEvent` | Processing begins |

| `JOB_PROGRESS` | `ProgressEvent` | Each denoising step (includes step/total_steps) |

| `JOB_COMPLETED` | `CompletedEvent` | Success (includes PIL Image and seed) |

| `JOB_FAILED` | `FailedEvent` | Error (includes message and traceback) |

| `JOB_CANCELLED` | `CancelledEvent` | Job cancelled before processing |

| `MODEL_LOADING` | `ModelLoadingEvent` | Model load starting |

| `MODEL_LOADED` | `ModelLoadedEvent` | Model loaded (includes size and device) |

| `MODEL_UNLOADED` | `ModelUnloadedEvent` | Model evicted from VRAM |

## Test Scripts

```bash

bash scripts/test_flux1.sh            # FLUX.1 Dev (BFL repo)

bash scripts/test_flux1_alt.sh        # FLUX.1 Dev (separate repos)

bash scripts/test_flux1_gguf.sh       # FLUX.1 Dev (GGUF Q8_0 transformer)

bash scripts/test_zimage.sh           # Z-Image Turbo

bash scripts/test_flux1_lora.sh       # FLUX.1 Dev + LoRA (bf16)

bash scripts/test_flux1_gguf_lora.sh  # FLUX.1 Dev + LoRA (GGUF Q8_0)

```

## Dependencies

- **PyTorch** >= 2.0 (CUDA recommended, MPS supported on M3+ with PyTorch 2.3+)

- **diffusers** >= 0.32

- **transformers** >= 4.40

- accelerate, safetensors, huggingface-hub

- pydantic >= 2.0, Pillow, click, einops, sentencepiece, loguru

- **fal-client** >= 0.5, httpx (FAL.ai cloud generation)

- **zeroconf** >= 0.131 (mDNS network announcement)

- fastapi >= 0.110, uvicorn[standard] >= 0.27 (REST API server)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rzem-ai/rzem-ai-inference-engine

Awesome Lists containing this project

README