An open API service indexing awesome lists of open source software.

https://github.com/joe0731/modelsig

Compare LLM architectures without downloading weights — structural fingerprint & proxy-test advisor for vLLM, TensorRT-LLM, SGLang, ONNX Runtime
https://github.com/joe0731/modelsig

architecture fingerprint gqa huggingface inference llama llm mistral model-analysis moe onnx onnxruntime proxy-testing qwen safetensors tensorrt-llm vllm

Last synced: about 2 months ago
JSON representation

Compare LLM architectures without downloading weights — structural fingerprint & proxy-test advisor for vLLM, TensorRT-LLM, SGLang, ONNX Runtime

Awesome Lists containing this project

README

          

# modelsig

**Compare LLM architectures without downloading weights.**

`modelsig` extracts a multi-layer structural fingerprint from any HuggingFace model and tells you whether two models are architecturally equivalent — so the smaller one can act as a valid proxy for testing the larger one.

[![Weekly Validation](https://github.com/joe0731/modelsig/actions/workflows/weekly-validation.yml/badge.svg)](https://github.com/joe0731/modelsig/actions/workflows/weekly-validation.yml)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)

---

## What problem does it solve?

Testing inference engines (vLLM, TensorRT-LLM, SGLang, llama.cpp, ONNX Runtime, etc.) against every large model is prohibitively expensive. `modelsig` answers:

> *"Can I test Qwen3-72B correctness using Qwen3-7B instead?"*
> *"Is Nemotron-120B-FP4 architecturally equivalent to the BF16 variant?"*
> *"Does this ONNX export match the original safetensors model?"*

It compares structural fingerprints — shape ratios, operator sets, KV cache patterns, layer topology — without ever downloading a single weight tensor.

---

## Key Features

- **Zero weight download** — safetensors header via HTTP Range (~20 bytes), ONNX graph-only (no `.onnx_data`), or config-only fast mode
- **5-layer fingerprint** — static weights, arch config, op types, KV cache pattern, layer-level I/O signatures
- **3-phase isomorphism comparison** — key overlap, substructure, algebraic scaling
- **Substitution verdicts** — `FULL_SUBSTITUTE / PARTIAL_SUBSTITUTE / NO_SUBSTITUTE`
- **4-level multi-fidelity test plan** — maps models to test coverage levels L1–L4
- **Wide model support** — dense decoder, GQA, MoE, vision-language, speech, ONNX classification
- **Both HF and local models** — supports `local:/path/to/model`
- **JSON / table / markdown output** — CI-friendly JSON, human-readable table, shareable markdown

---

## Installation

### From PyPI (recommended)

```bash
uv add modelsig # add to a uv project
# or
uv tool install modelsig # install as a standalone CLI tool
```

### From source

```bash
git clone https://github.com/joe0731/modelsig
cd modelsig
uv sync # install all deps + editable package
uv run modelsig --help
```

### Still using pip?

```bash
pip install modelsig
```

**Dependencies (all installed by default):**

| Package | Purpose |
|---------|---------|
| `requests` | HTTP Range fetching for safetensors headers |
| `huggingface_hub` | Model file listing, downloads, auth |
| `onnx` | ONNX graph parsing (falls back to built-in protobuf if unavailable) |
| `transformers` | AutoConfig normalization, layer signature capture |
| `torch` | Meta-device forward pass for layer I/O shape collection |
| `safetensors` | Local safetensors file parsing |

---

## Quick Start

```bash
# Analyze a single model (-m / --model flag)
modelsig -m Qwen/Qwen3-7B --output table

# Compare two models (proxy-test decision)
modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table

# Fast mode for large models (config only, no download)
modelsig -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --fast --output table

# ONNX model
modelsig -m onnx-community/Qwen3.5-0.8B-ONNX --output json

# Skip layer-level I/O signature capture (faster, no torch needed)
modelsig -m Qwen/Qwen3-7B --no-layer-sig --output json

# Private/gated model
modelsig -m org/private-model --token hf_xxx
# or: export HF_TOKEN=hf_xxx
```

---

## How It Works

### Zero-Weight-Download

For **safetensors** models, only the file header is fetched via HTTP Range requests (~20 bytes per shard). No weights are transferred.

For **ONNX** models, only the `.onnx` graph file is downloaded (typically 1–5 MB). The paired `.onnx_data` weight file (which can be GBs) is never touched.

For **fast mode** (`--fast`), only `config.json` is fetched (a few KB). No tensors at all.

### 5-Layer Signature System

| Layer | What it captures | Source |
|-------|-----------------|--------|
| **L1** Static weight signature | Per-tensor `{abstract_key → shape, dtype, layer_type}` — layer indices normalized to `.N.` | safetensors header / ONNX initializers |
| **L2** Architecture fingerprint | `hidden_size`, `num_hidden_layers`, `num_attention_heads`, `num_key_value_heads`, `intermediate_size`, `head_dim`, MoE config | `config.json` via AutoConfig |
| **L3** Op type set | Canonical operator vocabulary: `aten/mm`, `attention`, `rms_norm`, `rope`, `silu`, `topk/router` … | tensor key patterns / ONNX opset |
| **L4** KV cache shape pattern | `[batch, num_kv_heads, seq_len, head_dim]` | derived from L2 |
| **L5** Layer I/O signatures | Per-module `{input: [{dtype, shape}], output: [{dtype, shape}]}` on meta device | torch forward hooks (`--no-layer-sig` to skip) |

### 3-Phase Isomorphism Comparison

```
Phase 1 — Key coverage : normalized key set overlap ≥ 80%
Phase 2 — Substructure : attention / FFN / norm submodules match
Phase 3 — Algebraic scale : hidden_size / intermediate_size / head_dim ratios uniform within 20%
```

Result: `ISOMORPHIC` / `SCALE_ONLY` / `DIFFERENT_ARCH`

### Substitution Verdict

| Verdict | Meaning |
|---------|---------|
| `FULL_SUBSTITUTE` | All 3 phases pass + shape ratios uniform + layer_type_coverage ≥ 95% |
| `PARTIAL_SUBSTITUTE` | Phase 1+2 pass or op coverage ≥ 80% |
| `NO_SUBSTITUTE` | Different arch, MoE vs Dense mismatch, or key divergence |

### Quantization Transferability Estimate

When comparing two models, `modelsig` also computes a structural quantization transferability score:

```
struct_sim_score — 1.0 (ISOMORPHIC) / 0.80 (SCALE_ONLY) / 0.20 (DIFFERENT_ARCH)
op_hist_sim — cosine similarity of operator frequency vectors
layer_type_hist — Jaccard similarity of layer type sets
shape_uniform — whether common weight shapes scale uniformly
moe_correction — ~5% penalty for mixed MoE/Dense pairs
arch_risk_factors — hidden_size ratio, GQA mismatch, FFN expansion, RoPE theta diff
```

Output: `estimated_transferability` score (0–1) with `confidence` (HIGH/MEDIUM/LOW),
`recommended_methods` (GPTQ/AWQ/mixed-precision/expert-aware), and `caveats`.

> This is a **structural pre-filter only**. SensCorr and RepAlign require actual calibration
> data and are the strongest transfer predictors. Use this score to decide whether to attempt
> transfer at all, not as a final guarantee.

### Multi-Fidelity Test Plan (4 levels)

```
L1 Structure — cheapest: model loading, tensor shapes, dtype validation
L2 Numerical — cosine similarity, perplexity on calibration set
L3 Runtime — prefill latency, decode throughput, KV cache eviction
L4 Canary — large/MoE model: peak memory, TP/PP correctness
```

---

## Usage

### Basic — analyze a single model

```bash
modelsig -m Qwen/Qwen3-7B --output table
```

```
==============================================================================
modelsig v2.0 | 2026-03-17T10:00:00Z
==============================================================================

Model: Qwen/Qwen3-7B
type qwen3
hidden_size 3584
num_hidden_layers 28
num_attention_heads 28 (kv: 8)
intermediate_size 18944
head_dim 128
is_moe False
ffn_expansion 5.285714
gqa_ratio 3.5
kv_cache_pattern [batch, 8, seq_len, 128]
op_types aten/mm, attention, embedding, rms_norm, rope, silu, swiglu
layer_types AttentionLayer, EmbeddingLayer, FFN_SwiGLU, LMHead, RMSNorm
abstract_keys 14
source safetensors
```

### Compare models (proxy-testing decision)

```bash
modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table
```

### Full analysis with multi-fidelity plan

```bash
modelsig \
-m Qwen/Qwen3-7B -m Qwen/Qwen3-30B-A3B -m Qwen/Qwen3-235B-A22B \
--compare --multi-fidelity --output markdown --save report.md
```

### ONNX model

```bash
modelsig -m onnx-community/Qwen3-4B-ONNX --output json
```

### Config-only fast mode (no safetensors/ONNX fetch, instantaneous)

```bash
modelsig -m Qwen/Qwen3-235B-A22B --fast --output table
```

### Local model directory

```bash
modelsig local:/path/to/model --output json
modelsig local:/path/to/7b local:/path/to/72b --compare
```

### Private / gated models

```bash
modelsig -m org/private-model --token hf_xxx
# or: export HF_TOKEN=hf_xxx
```

### Save report

```bash
modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B \
--compare --output markdown --save report.md
```

### Models with custom code

```bash
# Only use --trust-remote-code for models you trust.
# This allows execution of arbitrary Python code from the model repository.
modelsig -m org/custom-model --trust-remote-code
```

---

## Scenario Examples

### Scenario 1 — Inference Engine Regression Testing

**Problem:** You want to validate a new vLLM kernel for Qwen3-72B but CI is limited to A10G GPUs (24 GB VRAM).

```bash
modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table
```

**Expected result:** `ISOMORPHIC / FULL_SUBSTITUTE` — same GQA pattern, same op set, uniform scaling. You can run full functional tests on 7B and gate the 72B behind a nightly canary run.

---

### Scenario 2 — MoE vs Dense Compatibility Check

**Problem:** Does Qwen3-30B-A3B (MoE) behave like a drop-in proxy for Qwen3-235B-A22B?

```bash
modelsig -m Qwen/Qwen3-30B-A3B -m Qwen/Qwen3-235B-A22B \
--compare --multi-fidelity --output markdown
```

Both are MoE models from the same family → `ISOMORPHIC`. The multi-fidelity plan shows:
- L1: use 30B-A3B for structure/conversion tests
- L2: numerical validation on 30B
- L4: 235B-A22B as canary for routing correctness and peak memory

---

### Scenario 3 — Cross-Family Sanity Check

**Problem:** Can Llama-3.1-8B proxy-test a Mistral-7B?

```bash
modelsig -m meta-llama/Llama-3.1-8B-Instruct -m mistralai/Mistral-7B-v0.1 \
--compare --output json
```

Both are dense GQA decoders with the same op set → `ISOMORPHIC / FULL_SUBSTITUTE`. Despite different model_type labels, the structural fingerprint matches.

---

### Scenario 4 — ONNX Runtime Compatibility

**Problem:** You converted GPT-2 to ONNX and want to verify the ONNX version matches the torch version structurally.

```bash
modelsig -m openai-community/gpt2 -m onnx-community/gpt2 --compare --output table
```

The ONNX version is parsed from the `.onnx` graph file. The safetensors version is parsed from the header. Both share the same abstract key set → `ISOMORPHIC`.

---

### Scenario 5 — Quantized Model Compatibility

**Problem:** Will `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` (quantized to FP4) behave the same as the BF16 variant?

```bash
modelsig \
-m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \
-m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
--compare --fast --output table
```

Both share the same architecture (120B MoE). `--fast` uses config-only mode to avoid downloading large safetensors headers. Result: `ISOMORPHIC` — same layer topology, only dtype differs.

---

### Scenario 6 — Quantization Method Transfer

**Problem:** You quantized Qwen3-7B with AWQ. Can that config transfer to Qwen3-72B?

```bash
modelsig \
-m Qwen/Qwen3-7B -m Qwen/Qwen3-72B \
--compare --output json --save qwen3_quant_transfer.json
```

The `quant_transfer` block in `coverage_matrix` gives:
- `estimated_transferability` — composite score (0–1) based on structural similarity
- `confidence` — HIGH/MEDIUM/LOW
- `recommended_methods` — e.g. `GPTQ (W4A16)`, `AWQ (W4A16)`, `Mixed-precision`
- `arch_risk_factors` — e.g. large hidden_size ratio, RoPE theta mismatch
- `caveats` — whether activation-aware recalibration is needed

---

## CLI Reference

```
modelsig [-m MODEL_ID ...] [MODEL_ID ...] [OPTIONS]

Arguments:
-m / --model MODEL_ID HF model ID or local:PATH (repeatable, preferred)
MODEL_ID positional alternative — same as -m

Options:
--output json | table | markdown (default: json)
--compare Compute pairwise coverage for all model pairs
--save FILE Save output to file
--fast Config-only mode — no safetensors/ONNX download
--multi-fidelity Include 4-level multi-fidelity test plan
--no-layer-sig Skip per-module I/O dtype+shape capture (faster)
--token TOKEN HF Hub token for private/gated models
--timeout SEC HTTP timeout (default: 30)
--no-color Disable ANSI colors in table output
--trust-remote-code Allow trust_remote_code=True for custom model code
⚠ enables arbitrary code execution — use only for trusted models
```

---

## Module Structure

```
modelsig/
├── analyze.py CLI entry point (~190 lines)
├── constants.py Shared constants: TOOL_NAME, _OP_RULES, _LAYER_TYPE_RULES, …

├── hf/
│ └── client.py HF Hub client: token management, HTTP GET + backoff,
│ model_info().siblings, hf_hub_download

├── parsers/
│ ├── safetensors.py HTTP Range header fetch + local shard discovery
│ └── config.py AutoConfig.from_pretrained() + _flatten_config() aliases

├── onnx/
│ ├── ops.py _ONNX_DTYPE map, _ONNX_OP_MAP, canonical op mapping
│ ├── parser.py onnx.load(load_external_data=False) + protobuf fallback
│ ├── selector.py Primary .onnx file selection heuristics
│ └── collector.py Orchestrates HF download → parse pipeline

├── torch/
│ └── layer_sig.py L5: per-module input/output dtype+shape via forward hooks

├── signature/
│ ├── static.py L1: build_static_weight_signature, norm_key, norm_dtype
│ ├── arch.py L2: build_arch_fingerprint, KV cache pattern, dim ratios
│ ├── template.py Per-layer canonical submodule template (for phase-2)
│ └── fingerprint.py ModelFingerprint dataclass + build_fingerprint orchestrator

├── comparison/
│ ├── phases.py Phase 1/2/3 isomorphism tests
│ ├── ratios.py Shape ratio uniformity analysis
│ ├── quant_transfer.py Structural quantization transferability estimator
│ ├── coverage.py Unified compute_coverage + test strategy + quant_transfer
│ └── multifidelity.py 4-level multi-fidelity test plan builder

└── output/
├── colors.py ANSI color helpers
├── json_fmt.py JSON formatter + fp_to_dict
├── table_fmt.py ANSI table formatter
└── markdown_fmt.py Markdown report formatter
```

---

## Security

- **No arbitrary code execution by default.** `trust_remote_code` is `False` unless explicitly set via `--trust-remote-code`.
- **Token safety.** The HF token is passed via HTTP headers only — never embedded in URLs or logged to stderr.
- **No weight download.** Only metadata (safetensors header, ONNX graph, config.json) is fetched.

---

## Design Principles

| Principle | Implementation |
|-----------|---------------|
| **Zero weight download** | HTTP Range (safetensors), graph-only .onnx, config-only fast path |
| **Framework-driven parsing** | `AutoConfig.from_pretrained()` for config normalization; `onnx.load()` for graph parsing |
| **Graceful degradation** | Every heavy dependency is optional — falls back to built-in parsers |
| **Architecture-agnostic** | Works on dense decoders, GQA models, MoE, vision-language, speech, classification |
| **Single CLI, composable API** | Import any module independently or use the unified CLI |
| **Safe by default** | `trust_remote_code=False`; token in headers not URLs |

---

## Supported Model Families

Validated weekly against **57 models** (29 safetensors + 28 ONNX):

**Safetensors (full header fetch):**
Qwen3.5-{0.8B,4B,9B,27B,35B-A3B,397B-A17B}, Qwen2.5-7B-Instruct, Qwen3-Coder-Next,
DeepSeek-V3.2, Kimi-K2.5, MiniMax-M2.5, GLM-5,
Nemotron-3-{Nano-4B, Super-120B}-{BF16,NVFP4,FP8},
Granite-4.0-1b-speech, BitNet-b1.58-2B-4T, MiroThinker-{1.7,1.7-mini},
Sarvam-{30b,105b}, Reka-edge-2603, LocoTrainer-4B, OmniCoder-9B,
Nanbeige4.1-3B, Param2-17B-A2.4B, gpt-oss-20b, all-MiniLM-L6-v2

**ONNX (graph-only, no weight download):**
Qwen3.5-{0.8B,2B,4B}-ONNX, Qwen3-{4B-VL,VL-2B,Reranker-0.6B}-ONNX,
Qwen2.5-{0.5B,VL-3B}-ONNX, LFM2-24B-A2B, Olmo-Hybrid-{SFT,DPO,Think}-7B,
Voxtral-Mini-4B, Granite-4.0-1b-speech, Nemotron-Nano-4B,
BERT-multilingual-NER, chinese-RoBERTa, multilingual-MiniLMv2, CodeT5,
Jan-code-4b, Josiefied-Qwen3.5-0.8B, IndoBERT-news-classification,
ai-image-detection × 4, vehicle-classification, tmr-text-detector

---

## Contributing

All logic is in the `modelsig/` package. Each subdirectory has a single responsibility. Tests live in `tests/` and cover 130+ unit + integration scenarios.

```bash
git clone https://github.com/joe0731/modelsig
cd modelsig
uv sync --extra dev # installs all deps + dev tools
uv run pytest tests/ -v
```

Weekly validation against the full model zoo runs via GitHub Actions (`.github/workflows/weekly-validation.yml`).

---

## Related Projects

- [huggingface_hub](https://github.com/huggingface/huggingface_hub) — HF Hub Python client
- [safetensors](https://github.com/huggingface/safetensors) — safe, zero-copy tensor serialization
- [vLLM](https://github.com/vllm-project/vllm) — high-throughput LLM inference
- [ONNX Runtime](https://github.com/microsoft/onnxruntime) — cross-platform inference accelerator

---

## License

Apache 2.0 — see [LICENSE](LICENSE).