https://github.com/joe0731/modelsig

Compare LLM architectures without downloading weights — structural fingerprint & proxy-test advisor for vLLM, TensorRT-LLM, SGLang, ONNX Runtime
https://github.com/joe0731/modelsig
architecture fingerprint gqa huggingface inference llama llm mistral model-analysis moe onnx onnxruntime proxy-testing qwen safetensors tensorrt-llm vllm
Last synced: 3 months ago
JSON representation
Compare LLM architectures without downloading weights — structural fingerprint & proxy-test advisor for vLLM, TensorRT-LLM, SGLang, ONNX Runtime
Host: GitHub
URL: https://github.com/joe0731/modelsig
Owner: joe0731
Created: 2026-03-17T08:07:42.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-18T06:42:34.000Z (4 months ago)
Last Synced: 2026-03-18T06:44:39.934Z (4 months ago)
Topics: architecture, fingerprint, gqa, huggingface, inference, llama, llm, mistral, model-analysis, moe, onnx, onnxruntime, proxy-testing, qwen, safetensors, tensorrt-llm, vllm
Language: Python
Homepage: https://github.com/joe0731/modelsig
Size: 75.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # modelsig

**Compare LLM architectures without downloading weights.**

`modelsig` extracts a multi-layer structural fingerprint from any HuggingFace model and tells you whether two models are architecturally equivalent — so the smaller one can act as a valid proxy for testing the larger one.

[![Weekly Validation](https://github.com/joe0731/modelsig/actions/workflows/weekly-validation.yml/badge.svg)](https://github.com/joe0731/modelsig/actions/workflows/weekly-validation.yml)

[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)

---

## What problem does it solve?

Testing inference engines (vLLM, TensorRT-LLM, SGLang, llama.cpp, ONNX Runtime, etc.) against every large model is prohibitively expensive. `modelsig` answers:

> *"Can I test Qwen3-72B correctness using Qwen3-7B instead?"*

> *"Is Nemotron-120B-FP4 architecturally equivalent to the BF16 variant?"*

> *"Does this ONNX export match the original safetensors model?"*

It compares structural fingerprints — shape ratios, operator sets, KV cache patterns, layer topology — without ever downloading a single weight tensor.

---

## Key Features

- **Zero weight download** — safetensors header via HTTP Range (~20 bytes), ONNX graph-only (no `.onnx_data`), or config-only fast mode

- **5-layer fingerprint** — static weights, arch config, op types, KV cache pattern, layer-level I/O signatures

- **3-phase isomorphism comparison** — key overlap, substructure, algebraic scaling

- **Substitution verdicts** — `FULL_SUBSTITUTE / PARTIAL_SUBSTITUTE / NO_SUBSTITUTE`

- **4-level multi-fidelity test plan** — maps models to test coverage levels L1–L4

- **Wide model support** — dense decoder, GQA, MoE, vision-language, speech, ONNX classification

- **Both HF and local models** — supports `local:/path/to/model`

- **JSON / table / markdown output** — CI-friendly JSON, human-readable table, shareable markdown

---

## Installation

### From PyPI (recommended)

```bash

uv add modelsig           # add to a uv project

# or

uv tool install modelsig  # install as a standalone CLI tool

```

### From source

```bash

git clone https://github.com/joe0731/modelsig

cd modelsig

uv sync                   # install all deps + editable package

uv run modelsig --help

```

### Still using pip?

```bash

pip install modelsig

```

**Dependencies (all installed by default):**

| Package | Purpose |

|---------|---------|

| `requests` | HTTP Range fetching for safetensors headers |

| `huggingface_hub` | Model file listing, downloads, auth |

| `onnx` | ONNX graph parsing (falls back to built-in protobuf if unavailable) |

| `transformers` | AutoConfig normalization, layer signature capture |

| `torch` | Meta-device forward pass for layer I/O shape collection |

| `safetensors` | Local safetensors file parsing |

---

## Quick Start

```bash

# Analyze a single model (-m / --model flag)

modelsig -m Qwen/Qwen3-7B --output table

# Compare two models (proxy-test decision)

modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table

# Fast mode for large models (config only, no download)

modelsig -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 --fast --output table

# ONNX model

modelsig -m onnx-community/Qwen3.5-0.8B-ONNX --output json

# Skip layer-level I/O signature capture (faster, no torch needed)

modelsig -m Qwen/Qwen3-7B --no-layer-sig --output json

# Private/gated model

modelsig -m org/private-model --token hf_xxx

# or: export HF_TOKEN=hf_xxx

```

---

## How It Works

### Zero-Weight-Download

For **safetensors** models, only the file header is fetched via HTTP Range requests (~20 bytes per shard). No weights are transferred.

For **ONNX** models, only the `.onnx` graph file is downloaded (typically 1–5 MB). The paired `.onnx_data` weight file (which can be GBs) is never touched.

For **fast mode** (`--fast`), only `config.json` is fetched (a few KB). No tensors at all.

### 5-Layer Signature System

| Layer | What it captures | Source |

|-------|-----------------|--------|

| **L1** Static weight signature | Per-tensor `{abstract_key → shape, dtype, layer_type}` — layer indices normalized to `.N.` | safetensors header / ONNX initializers |

| **L2** Architecture fingerprint | `hidden_size`, `num_hidden_layers`, `num_attention_heads`, `num_key_value_heads`, `intermediate_size`, `head_dim`, MoE config | `config.json` via AutoConfig |

| **L3** Op type set | Canonical operator vocabulary: `aten/mm`, `attention`, `rms_norm`, `rope`, `silu`, `topk/router` … | tensor key patterns / ONNX opset |

| **L4** KV cache shape pattern | `[batch, num_kv_heads, seq_len, head_dim]` | derived from L2 |

| **L5** Layer I/O signatures | Per-module `{input: [{dtype, shape}], output: [{dtype, shape}]}` on meta device | torch forward hooks (`--no-layer-sig` to skip) |

### 3-Phase Isomorphism Comparison

```

Phase 1 — Key coverage    : normalized key set overlap ≥ 80%

Phase 2 — Substructure    : attention / FFN / norm submodules match

Phase 3 — Algebraic scale : hidden_size / intermediate_size / head_dim ratios uniform within 20%

```

Result: `ISOMORPHIC` / `SCALE_ONLY` / `DIFFERENT_ARCH`

### Substitution Verdict

| Verdict | Meaning |

|---------|---------|

| `FULL_SUBSTITUTE` | All 3 phases pass + shape ratios uniform + layer_type_coverage ≥ 95% |

| `PARTIAL_SUBSTITUTE` | Phase 1+2 pass or op coverage ≥ 80% |

| `NO_SUBSTITUTE` | Different arch, MoE vs Dense mismatch, or key divergence |

### Quantization Transferability Estimate

When comparing two models, `modelsig` also computes a structural quantization transferability score:

```

struct_sim_score   — 1.0 (ISOMORPHIC) / 0.80 (SCALE_ONLY) / 0.20 (DIFFERENT_ARCH)

op_hist_sim        — cosine similarity of operator frequency vectors

layer_type_hist    — Jaccard similarity of layer type sets

shape_uniform      — whether common weight shapes scale uniformly

moe_correction     — ~5% penalty for mixed MoE/Dense pairs

arch_risk_factors  — hidden_size ratio, GQA mismatch, FFN expansion, RoPE theta diff

```

Output: `estimated_transferability` score (0–1) with `confidence` (HIGH/MEDIUM/LOW),

`recommended_methods` (GPTQ/AWQ/mixed-precision/expert-aware), and `caveats`.

> This is a **structural pre-filter only**. SensCorr and RepAlign require actual calibration

> data and are the strongest transfer predictors. Use this score to decide whether to attempt

> transfer at all, not as a final guarantee.

### Multi-Fidelity Test Plan (4 levels)

```

L1 Structure    — cheapest: model loading, tensor shapes, dtype validation

L2 Numerical    — cosine similarity, perplexity on calibration set

L3 Runtime      — prefill latency, decode throughput, KV cache eviction

L4 Canary       — large/MoE model: peak memory, TP/PP correctness

```

---

## Usage

### Basic — analyze a single model

```bash

modelsig -m Qwen/Qwen3-7B --output table

```

```

==============================================================================

  modelsig v2.0  |  2026-03-17T10:00:00Z

==============================================================================

   Model: Qwen/Qwen3-7B

  type                   qwen3

  hidden_size            3584

  num_hidden_layers      28

  num_attention_heads    28  (kv: 8)

  intermediate_size      18944

  head_dim               128

  is_moe                 False

  ffn_expansion          5.285714

  gqa_ratio              3.5

  kv_cache_pattern       [batch, 8, seq_len, 128]

  op_types               aten/mm, attention, embedding, rms_norm, rope, silu, swiglu

  layer_types            AttentionLayer, EmbeddingLayer, FFN_SwiGLU, LMHead, RMSNorm

  abstract_keys          14

  source                 safetensors

```

### Compare models (proxy-testing decision)

```bash

modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table

```

### Full analysis with multi-fidelity plan

```bash

modelsig \

    -m Qwen/Qwen3-7B -m Qwen/Qwen3-30B-A3B -m Qwen/Qwen3-235B-A22B \

    --compare --multi-fidelity --output markdown --save report.md

```

### ONNX model

```bash

modelsig -m onnx-community/Qwen3-4B-ONNX --output json

```

### Config-only fast mode (no safetensors/ONNX fetch, instantaneous)

```bash

modelsig -m Qwen/Qwen3-235B-A22B --fast --output table

```

### Local model directory

```bash

modelsig local:/path/to/model --output json

modelsig local:/path/to/7b local:/path/to/72b --compare

```

### Private / gated models

```bash

modelsig -m org/private-model --token hf_xxx

# or: export HF_TOKEN=hf_xxx

```

### Save report

```bash

modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B \

    --compare --output markdown --save report.md

```

### Models with custom code

```bash

# Only use --trust-remote-code for models you trust.

# This allows execution of arbitrary Python code from the model repository.

modelsig -m org/custom-model --trust-remote-code

```

---

## Scenario Examples

### Scenario 1 — Inference Engine Regression Testing

**Problem:** You want to validate a new vLLM kernel for Qwen3-72B but CI is limited to A10G GPUs (24 GB VRAM).

```bash

modelsig -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B --compare --output table

```

**Expected result:** `ISOMORPHIC / FULL_SUBSTITUTE` — same GQA pattern, same op set, uniform scaling. You can run full functional tests on 7B and gate the 72B behind a nightly canary run.

---

### Scenario 2 — MoE vs Dense Compatibility Check

**Problem:** Does Qwen3-30B-A3B (MoE) behave like a drop-in proxy for Qwen3-235B-A22B?

```bash

modelsig -m Qwen/Qwen3-30B-A3B -m Qwen/Qwen3-235B-A22B \

    --compare --multi-fidelity --output markdown

```

Both are MoE models from the same family → `ISOMORPHIC`. The multi-fidelity plan shows:

- L1: use 30B-A3B for structure/conversion tests

- L2: numerical validation on 30B

- L4: 235B-A22B as canary for routing correctness and peak memory

---

### Scenario 3 — Cross-Family Sanity Check

**Problem:** Can Llama-3.1-8B proxy-test a Mistral-7B?

```bash

modelsig -m meta-llama/Llama-3.1-8B-Instruct -m mistralai/Mistral-7B-v0.1 \

    --compare --output json

```

Both are dense GQA decoders with the same op set → `ISOMORPHIC / FULL_SUBSTITUTE`. Despite different model_type labels, the structural fingerprint matches.

---

### Scenario 4 — ONNX Runtime Compatibility

**Problem:** You converted GPT-2 to ONNX and want to verify the ONNX version matches the torch version structurally.

```bash

modelsig -m openai-community/gpt2 -m onnx-community/gpt2 --compare --output table

```

The ONNX version is parsed from the `.onnx` graph file. The safetensors version is parsed from the header. Both share the same abstract key set → `ISOMORPHIC`.

---

### Scenario 5 — Quantized Model Compatibility

**Problem:** Will `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` (quantized to FP4) behave the same as the BF16 variant?

```bash

modelsig \

    -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 \

    -m nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \

    --compare --fast --output table

```

Both share the same architecture (120B MoE). `--fast` uses config-only mode to avoid downloading large safetensors headers. Result: `ISOMORPHIC` — same layer topology, only dtype differs.

---

### Scenario 6 — Quantization Method Transfer

**Problem:** You quantized Qwen3-7B with AWQ. Can that config transfer to Qwen3-72B?

```bash

modelsig \

    -m Qwen/Qwen3-7B -m Qwen/Qwen3-72B \

    --compare --output json --save qwen3_quant_transfer.json

```

The `quant_transfer` block in `coverage_matrix` gives:

- `estimated_transferability` — composite score (0–1) based on structural similarity

- `confidence` — HIGH/MEDIUM/LOW

- `recommended_methods` — e.g. `GPTQ (W4A16)`, `AWQ (W4A16)`, `Mixed-precision`

- `arch_risk_factors` — e.g. large hidden_size ratio, RoPE theta mismatch

- `caveats` — whether activation-aware recalibration is needed

---

## CLI Reference

```

modelsig [-m MODEL_ID ...] [MODEL_ID ...] [OPTIONS]

Arguments:

  -m / --model MODEL_ID  HF model ID or local:PATH (repeatable, preferred)

  MODEL_ID               positional alternative — same as -m

Options:

  --output              json | table | markdown  (default: json)

  --compare             Compute pairwise coverage for all model pairs

  --save FILE           Save output to file

  --fast                Config-only mode — no safetensors/ONNX download

  --multi-fidelity      Include 4-level multi-fidelity test plan

  --no-layer-sig        Skip per-module I/O dtype+shape capture (faster)

  --token TOKEN         HF Hub token for private/gated models

  --timeout SEC         HTTP timeout (default: 30)

  --no-color            Disable ANSI colors in table output

  --trust-remote-code   Allow trust_remote_code=True for custom model code

                        ⚠ enables arbitrary code execution — use only for trusted models

```

---

## Module Structure

``` 
modelsig/ 
├── analyze.py 
├── constants.py 
│ 
├── hf/ 
│   └── client.py 
│ 
│ 
├── parsers/ 
│   ├── safetensors.py 
│   └── config.py 
│ 
├── onnx/ 
│   ├── ops.py 
│   ├── parser.py 
│   ├── selector.py 
│   └── collector.py 
│ 
├── torch/ 
│   └── layer_sig.py 
│ 
├── signature/ 
│   ├── static.py 
│   ├── arch.py 
│   ├── template.py 
│   └── fingerprint.py 
│ 
├── comparison/ 
│   ├── phases.py 
│   ├── ratios.py 
│   ├── quant_transfer.py 
│   ├── coverage.py 
│   └── multifidelity.py 
│ 
└── output/ 
    ├── colors.py 
    ├── json_fmt.py 
    ├── table_fmt.py 
    └── markdown_fmt.py 
```

CLI entry point (~190 lines) Shared constants: TOOL_NAME, _OP_RULES, _LAYER_TYPE_RULES, … HF Hub client: token management, HTTP GET + backoff, model_info().siblings, hf_hub_download HTTP Range header fetch + local shard discovery AutoConfig.from_pretrained() + _flatten_config() aliases _ONNX_DTYPE map, _ONNX_OP_MAP, canonical op mapping onnx.load(load_external_data=False) + protobuf fallback Primary .onnx file selection heuristics Orchestrates HF download → parse pipeline L5: per-module input/output dtype+shape via forward hooks L1: build_static_weight_signature, norm_key, norm_dtype L2: build_arch_fingerprint, KV cache pattern, dim ratios Per-layer canonical submodule template (for phase-2) ModelFingerprint dataclass + build_fingerprint orchestrator Phase 1/2/3 isomorphism tests Shape ratio uniformity analysis Structural quantization transferability estimator Unified compute_coverage + test strategy + quant_transfer 4-level multi-fidelity test plan builder ANSI color helpers JSON formatter + fp_to_dict ANSI table formatter Markdown report formatter

---

## Security

- **No arbitrary code execution by default.** `trust_remote_code` is `False` unless explicitly set via `--trust-remote-code`.

- **Token safety.** The HF token is passed via HTTP headers only — never embedded in URLs or logged to stderr.

- **No weight download.** Only metadata (safetensors header, ONNX graph, config.json) is fetched.

---

## Design Principles

| Principle | Implementation |

|-----------|---------------|

| **Zero weight download** | HTTP Range (safetensors), graph-only .onnx, config-only fast path |

| **Framework-driven parsing** | `AutoConfig.from_pretrained()` for config normalization; `onnx.load()` for graph parsing |

| **Graceful degradation** | Every heavy dependency is optional — falls back to built-in parsers |

| **Architecture-agnostic** | Works on dense decoders, GQA models, MoE, vision-language, speech, classification |

| **Single CLI, composable API** | Import any module independently or use the unified CLI |

| **Safe by default** | `trust_remote_code=False`; token in headers not URLs |

---

## Supported Model Families

Validated weekly against **57 models** (29 safetensors + 28 ONNX):

**Safetensors (full header fetch):**

Qwen3.5-{0.8B,4B,9B,27B,35B-A3B,397B-A17B}, Qwen2.5-7B-Instruct, Qwen3-Coder-Next,

DeepSeek-V3.2, Kimi-K2.5, MiniMax-M2.5, GLM-5,

Nemotron-3-{Nano-4B, Super-120B}-{BF16,NVFP4,FP8},

Granite-4.0-1b-speech, BitNet-b1.58-2B-4T, MiroThinker-{1.7,1.7-mini},

Sarvam-{30b,105b}, Reka-edge-2603, LocoTrainer-4B, OmniCoder-9B,

Nanbeige4.1-3B, Param2-17B-A2.4B, gpt-oss-20b, all-MiniLM-L6-v2

**ONNX (graph-only, no weight download):**

Qwen3.5-{0.8B,2B,4B}-ONNX, Qwen3-{4B-VL,VL-2B,Reranker-0.6B}-ONNX,

Qwen2.5-{0.5B,VL-3B}-ONNX, LFM2-24B-A2B, Olmo-Hybrid-{SFT,DPO,Think}-7B,

Voxtral-Mini-4B, Granite-4.0-1b-speech, Nemotron-Nano-4B,

BERT-multilingual-NER, chinese-RoBERTa, multilingual-MiniLMv2, CodeT5,

Jan-code-4b, Josiefied-Qwen3.5-0.8B, IndoBERT-news-classification,

ai-image-detection × 4, vehicle-classification, tmr-text-detector

---

## Contributing

All logic is in the `modelsig/` package. Each subdirectory has a single responsibility. Tests live in `tests/` and cover 130+ unit + integration scenarios.

```bash

git clone https://github.com/joe0731/modelsig

cd modelsig

uv sync --extra dev       # installs all deps + dev tools

uv run pytest tests/ -v

```

Weekly validation against the full model zoo runs via GitHub Actions (`.github/workflows/weekly-validation.yml`).

---

## Related Projects

- [huggingface_hub](https://github.com/huggingface/huggingface_hub) — HF Hub Python client

- [safetensors](https://github.com/huggingface/safetensors) — safe, zero-copy tensor serialization

- [vLLM](https://github.com/vllm-project/vllm) — high-throughput LLM inference

- [ONNX Runtime](https://github.com/microsoft/onnxruntime) — cross-platform inference accelerator

---

## License

Apache 2.0 — see [LICENSE](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/joe0731/modelsig

Awesome Lists containing this project

README