{"id":45304572,"url":"https://github.com/solomonb14d3/knowledge-fidelity","last_synced_at":"2026-02-27T05:00:43.540Z","repository":{"id":339710310,"uuid":"1163081641","full_name":"SolomonB14D3/knowledge-fidelity","owner":"SolomonB14D3","description":"Behavioral auditing toolkit for LLMs: rho-audit measures factual accuracy, bias, sycophancy, toxicity, and reasoning via teacher-forced confidence probes. SVD compression with knowledge preservation. Steering vectors for runtime behavioral control. 12-model merge audit across SLERP/TIES/DARE-TIES/Linear.","archived":false,"fork":false,"pushed_at":"2026-02-24T22:57:29.000Z","size":3575,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-25T06:39:14.163Z","etag":null,"topics":["activation-engineering","behavioral-evaluation","bias-detection","confidence","interpretability","llm-compression","mergekit","model-auditing","model-merging","pytorch","rho-audit","steering-vectors","svd","sycophancy","transformers","truthfulness"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SolomonB14D3.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["SolomonB14D3"]}},"created_at":"2026-02-21T04:14:47.000Z","updated_at":"2026-02-24T17:59:17.000Z","dependencies_parsed_at":"2026-02-25T03:03:12.181Z","dependency_job_id":null,"html_url":"https://github.com/SolomonB14D3/knowledge-fidelity","commit_stats":null,"previous_names":["solomonb14d3/knowledge-fidelity"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/SolomonB14D3/knowledge-fidelity","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolomonB14D3%2Fknowledge-fidelity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolomonB14D3%2Fknowledge-fidelity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolomonB14D3%2Fknowledge-fidelity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolomonB14D3%2Fknowledge-fidelity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SolomonB14D3","download_url":"https://codeload.github.com/SolomonB14D3/knowledge-fidelity/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SolomonB14D3%2Fknowledge-fidelity/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29885795,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-26T23:51:21.483Z","status":"online","status_checked_at":"2026-02-27T02:00:06.759Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["activation-engineering","behavioral-evaluation","bias-detection","confidence","interpretability","llm-compression","mergekit","model-auditing","model-merging","pytorch","rho-audit","steering-vectors","svd","sycophancy","transformers","truthfulness"],"created_at":"2026-02-21T06:16:34.304Z","updated_at":"2026-02-27T05:00:43.532Z","avatar_url":"https://github.com/SolomonB14D3.png","language":"Python","funding_links":["https://github.com/sponsors/SolomonB14D3"],"categories":[],"sub_categories":[],"readme":"# rho-eval v2.2: The Behavioral Forensic Suite\n\n[![PyPI](https://img.shields.io/pypi/v/rho-eval)](https://pypi.org/project/rho-eval/)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18743959.svg)](https://doi.org/10.5281/zenodo.18743959)\n[![Tests](https://img.shields.io/badge/tests-180%20passed-brightgreen)]()\n[![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Spaces-Demo-blue)](https://huggingface.co/spaces/bsanch52/knowledge-fidelity-demo)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Awesome](https://img.shields.io/badge/Awesome-LLM--Compression-blue)](https://github.com/HuangOwen/Awesome-LLM-Compression#tools)\n[![MLX](https://img.shields.io/badge/Apple%20Silicon-MLX%20Accelerated-black?logo=apple)](https://github.com/ml-explore/mlx)\n[![Sponsor](https://img.shields.io/badge/Sponsor-%E2%9D%A4-pink?logo=github)](https://github.com/sponsors/SolomonB14D3)\n\n**Mechanistic interpretability, disentangled steering, and the Truth-Gap benchmark.**\n\n\u003e **Current finding:** The Alignment Kill Zone at 44–50% depth is universal across transformer architectures, but the behavioral response is training-dependent. Three archetypes emerge: Modular (Qwen), Entangled (Mistral), and Overridden (Llama) — where the model knows the truth but is trained to suppress it.\n\n## Project Overview\n\nrho-eval is a full-stack research framework for auditing, interpreting, and steering the internal states of large language models. Version 2.2 expands to 8 behavioral dimensions with 1,106 probes shipped as JSON — no internet required.\n\n| Module | Purpose |\n|--------|---------|\n| **`rho-audit`** | High-resolution behavioral auditing using teacher-forced confidence probes across 8 dimensions |\n| **`rho-interpret`** | Mechanistic interpretability via SVD subspace extraction and Grassmann angle analysis |\n| **`rho-align`** | Rho-Guided SFT with an auxiliary contrastive loss to preserve knowledge fidelity during alignment |\n| **`rho-steer`** | Disentangled steering using Gated Sparse Autoencoders (SAEs) to isolate monosemantic features |\n| **`rho-bench`** | Fidelity-Bench 2.0: adversarial pressure testing that measures the Truth-Gap |\n\n\u003e *Formerly `knowledge-fidelity`. All v1.x imports still work.*\n\n## The Architecture-Contingent Paradox\n\nOur v2.0 audit across three model families reveals a three-way taxonomy of behavioral anatomy — and a universal geometric structure:\n\n| Model | Behavioral Profile | Kill Zone Response | Cognitive Dissonance |\n|-------|-------------------|-------------------|---------------------|\n| **Qwen 2.5** | Modular | Surgical steering at L17 maximizes truth without bias collapse | 0.653 |\n| **Mistral v0.3** | Entangled | Kill Zone (L14-L18) causes catastrophic bias collapse (−0.460 ρ) | 0.664 |\n| **Llama 3.1** | Overridden | Kill Zone matches Mistral (L14-L16), but sycophancy so extreme (0.047 ρ) steering barely registers | 0.853 |\n\n**The Universal Kill Zone.** All three architectures share an Alignment Kill Zone at 44–50% depth (L14–L16 in 32-layer models, L12–L14 in 28-layer). The zone's *location* is universal; the *response* is training-dependent. Qwen's RLHF created modular representations that survive intervention. Mistral's alignment entangled social awareness with compliance. Llama's RLHF enforced compliance so aggressively that the model *knows the truth* (bias ρ = 0.900, highest of all three) but *refuses to act on it* (sycophancy ρ = 0.047, lowest of all three).\n\n**Cognitive Dissonance** (bias ρ − sycophancy ρ) measures the gap between what a model knows and how it behaves. Llama's 0.853 dissonance is 28% higher than either competitor — it has the strongest truth signal and the weakest truth *expression*.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/mistral_layer_heatmap.png\" alt=\"Mistral Alignment Kill Zone\" width=\"600\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eThe Alignment Kill Zone: Layers 14–18 in Mistral destroy bias detection (red) while providing zero sycophancy improvement (orange). Only factual steering at L24 transfers across architectures.\u003c/em\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/cocktail_tradeoff.png\" alt=\"Layer 17 Behavioral Entanglement\" width=\"500\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eLayer 17 interference on Qwen: the slope of −1.37 between sycophancy and bias rho directly measures behavioral entanglement. No cocktail configuration reaches the target zone (green).\u003c/em\u003e\u003c/p\u003e\n\n### Llama 3.1: The \"Overridden\" Archetype\n\nLlama-3.1-8B-Instruct presents the most extreme behavioral profile in our study. Its baseline audit reveals a model that *knows the truth* but has been trained to *never express it*:\n\n| Behavior | Llama ρ | Qwen ρ | Mistral ρ | Llama Rank |\n|----------|:-------:|:------:|:---------:|:----------:|\n| **Bias** | **+0.900** | +0.773 | +0.797 | 1st |\n| Factual | +0.486 | +0.474 | +0.585 | 2nd |\n| Toxicity | +0.510 | +0.521 | — | 2nd |\n| Reasoning | +0.100 | — | — | — |\n| **Sycophancy** | **+0.047** | +0.120 | +0.133 | **Last** |\n\nLlama's sycophancy ρ of 0.047 means it agrees with the user on 95% of false claims — including claims it can correctly identify as false (bias ρ = 0.900). The layer heatmap confirms why steering cannot help:\n\n| Layer | Depth | Factual ρ | Syc ρ | Bias ρ | ΔSyc | ΔBias |\n|:-----:|:-----:|:---------:|:-----:|:------:|:----:|:-----:|\n| 10 | 31% | 0.489 | 0.053 | 0.917 | +0.007 | +0.020 |\n| 12 | 38% | 0.492 | 0.120 | 0.813 | +0.073 | −0.083 |\n| **14** | **44%** | 0.435 | 0.133 | **0.487** | +0.087 | **−0.410** |\n| **16** | **50%** | 0.500 | 0.013 | **0.507** | −0.033 | **−0.390** |\n| 18 | 56% | 0.478 | 0.007 | 0.893 | −0.040 | −0.003 |\n| 20 | 62% | 0.506 | 0.027 | 0.900 | −0.020 | +0.003 |\n| 24 | 75% | 0.492 | 0.047 | 0.893 | +0.000 | −0.003 |\n| 28 | 88% | 0.489 | 0.060 | 0.897 | +0.013 | +0.000 |\n\nBaselines: factual=0.487, sycophancy=0.047, bias=0.897. The Kill Zone at L14-L16 mirrors Mistral exactly (bias collapses from 0.90 to 0.49-0.51), but L28-L30 are completely inert — falsifying the \"Late-Stage Filter\" hypothesis. The sycophancy override is not implemented by late layers; it pervades the entire forward pass. The trade-off slope is −4.7 (vs Qwen's −1.37), meaning Llama's entanglement is 3.4× steeper.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/llama_layer_heatmap.png\" alt=\"Llama Overridden Archetype\" width=\"600\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eLlama-3.1-8B: The \"Overridden\" Archetype. Bias (purple) and sycophancy (orange) are separated by 0.853 cognitive dissonance — the model knows truth but won't express it. Kill Zone at L14-L16 matches Mistral.\u003c/em\u003e\u003c/p\u003e\n\n### Mechanistic Interpretability: SVD Subspace Analysis\n\nWe decompose Llama-3.1-8B's activation space into behavioral subspaces via SVD at 6 layers (L8, L12, L16, L20, L24, L28), computing Grassmann principal angles between all behavior pairs. This reveals the geometric structure underlying the Overridden archetype.\n\n**Grassmann angles confirm bias↔sycophancy entanglement.** The principal angle between bias and sycophancy subspaces is the smallest of any behavior pair across all layers (81.3°–84.5°), while all other pairs remain near-orthogonal (83.5°–86.7°). This is the geometric signature of Cognitive Dissonance: truth and compliance directions are partially overlapping in activation space.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/interpretability/overlap_subspace_angles_llama_3.1_8b_instruct.png\" alt=\"Grassmann Angle Heatmaps\" width=\"700\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eGrassmann principal angles between behavioral subspaces at each layer. Bias↔sycophancy (81–84°) is consistently the most entangled pair. Near-orthogonal pairs (86°+) can be steered independently.\u003c/em\u003e\u003c/p\u003e\n\n**Truth is concentrated; compliance spreads.** Bias subspaces have effective dimensionality 2–6 (the first singular value explains 56–89% of variance), meaning truth knowledge lives in a near-rank-1 direction. Sycophancy peaks at dim=9 at L16 — exactly the Kill Zone — where compliance behavior spreads across the maximum number of directions, giving it maximum leverage over the concentrated truth signal.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/interpretability/dimensionality_llama_3.1_8b_instruct.png\" alt=\"Subspace Dimensionality\" width=\"500\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003eEffective dimensionality (90% variance threshold) per behavior across layers. Bias (blue) is concentrated; sycophancy (green) peaks at L16 (Kill Zone); factual and toxicity saturate at dim=10.\u003c/em\u003e\u003c/p\u003e\n\n**Surgical rank-1 steering confirms cross-behavior contamination.** Applying a single rank-1 factual direction at L8 produces +0.046 factual improvement but simultaneously +0.220 sycophancy increase and −0.257 bias collapse. You cannot touch one behavioral direction without destabilizing others — the subspaces physically overlap in Llama's representation. The best sycophancy gain from rank-1 steering is +0.047 at L16, but it costs −0.087 in bias (1.85:1 damage ratio).\n\n**Individual \"truth heads\" identified.** Head attribution reveals L16/H30 (importance 0.705) as the single most important head for bias encoding — sitting directly in the Kill Zone. The top sycophancy head is L8/H13 (importance 0.702), concentrated in early layers. This separation explains why early-layer interventions are catastrophic: they contaminate the sycophancy signal before it reaches the truth-encoding heads.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/interpretability/head_importance_llama_3.1_8b_instruct.png\" alt=\"Head Attribution\" width=\"700\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cem\u003ePer-head importance scores across 6 layers × 32 heads for each behavior. Sparse \"hot\" heads indicate concentrated encoding (bias, sycophancy) vs diffuse encoding (factual, toxicity).\u003c/em\u003e\u003c/p\u003e\n\n## Fidelity-Bench 2.0: Measuring the Truth-Gap\n\nWe introduce the Truth-Gap, a metric quantifying how much factual integrity a model sacrifices under social pressure:\n\n```\nDelta_F = rho_baseline - rho_pressured\n```\n\nUsing our 120-probe adversarial suite (Logic, Social, Clinical), we provide **Model Fidelity Certificates** that move evaluation from \"accuracy\" to \"robustness under duress.\" Six pressure levels escalate from neutral queries through flattery, authority claims, social pressure, gaslighting, and maximum combined pressure.\n\n```bash\nrho-bench Qwen/Qwen2.5-7B-Instruct\n\n  Fidelity-Bench 2.0: Qwen/Qwen2.5-7B-Instruct\n  Grade: B   Composite: 0.682 [0.651, 0.710]\n\n  Truth-Gap Analysis\n  Domain      Baseline  Pressured     DF  Unbreak\n  logic        +0.8200    +0.7100  +0.1100     62%\n  social       +0.7500    +0.4800  +0.2700     31%\n  clinical     +0.8800    +0.7900  +0.0900     71%\n  overall      +0.8200    +0.6600  +0.1600     55%\n```\n\n**Floor-effect caveat.** The \"% unbreakable\" metric requires careful interpretation. Llama-3.1-8B-Instruct scores 78% unbreakable with a Grade F (composite 0.091) — not because it resists pressure, but because its baseline truth scores are already near zero (logic ρ = 0.08, social ρ = 0.03). A model that starts at the floor cannot break further. The Truth-Gap ΔF is small (0.03) because there was no truth to lose. Compare with Qwen's 55% unbreakable at Grade B — those probes genuinely maintained truth under maximum pressure. Future versions should distinguish \"resilient\" (high baseline, survived pressure) from \"pre-collapsed\" (low baseline, nothing left to break).\n\n## Associated Papers\n\n\u003e Sanchez, B. (2026). *Rho-Guided Supervised Fine-Tuning: Post-Training Repair of Calibration Damage in Large Language Models.* [`paper/rho_guided_sft.md`](paper/rho_guided_sft.md)\n\nStandard SFT inverts toxicity discrimination ($\\rho = +0.145 \\to -0.003$, $n = 5$ seeds). Adding a contrastive auxiliary loss repairs this with a monotonic dose-response ($\\rho = +1.137$ at $\\lambda_\\rho = 0.5$). Across a 5-seed ablation on 8 behavioral dimensions: rho-guided vs SFT-only achieves $d = 10.8$ on toxicity and $d = 13.7$ on bias ($p \u003c 0.0001$). Contrastive-only training erodes refusal by $\\Delta\\rho = -0.084$ ($d = -8.4$, $p = 0.0005$), while rho-guided SFT preserves it ($\\Delta\\rho = +0.014$). The margin $\\gamma = 0.1$ is necessary: without it, bias goes negative.\n\n\u003e Sanchez, B. (2026). *Behavioral Entanglement in Transformers: SAE-Based Disentanglement and the Architecture-Contingent Nature of Sycophancy.* Zenodo. [doi:10.5281/zenodo.18743959](https://doi.org/10.5281/zenodo.18743959)\n\nThe diagnostic-to-intervention pipeline: behavioral auditing, SVD subspace analysis, Gated SAE disentanglement, and Fidelity-Bench 2.0 validation.\n\nSee also: Sanchez, B. (2026). *Confidence Cartography: Teacher-Forced Probability as a False-Belief Sensor in Language Models.* [doi:10.5281/zenodo.18703506](https://doi.org/10.5281/zenodo.18703506)\n\n## Key Findings\n\n- **Standard SFT inverts confidence calibration.** A single epoch of SFT on Qwen2.5-7B-Instruct flips toxicity discrimination from $\\rho = +0.145$ to $\\rho = -0.003$ (5 seeds, $p \u003c 0.001$). Rho-guided SFT repairs this: $d = 10.8$ on toxicity and $d = 13.7$ on bias vs SFT-only ($p \u003c 0.0001$, 5 seeds). The effect replicates on Llama-3.1-8B. Variance collapse accompanies the repair: factual $\\sigma$ drops from $0.105$ (SFT-only) to $0.039$ (rho-guided), a 63% reduction.\n- **Contrastive-only training erodes refusal capability.** Training with only the contrastive loss (no SFT) erodes model refusal by $\\Delta\\rho = -0.084$ ($d = -8.4$, $p = 0.0005$), while the full rho-guided method preserves it ($\\Delta\\rho = +0.014$). The SFT component acts as a \"refusal buffer\" that prevents the contrastive gradient from stripping safety-trained refusal behavior.\n- **The margin $\\gamma$ is structurally necessary.** Without the hinge margin ($\\gamma = 0$), rho-guided SFT drives bias negative ($\\Delta\\rho = -0.011$). With $\\gamma = 0.1$, bias stays positive ($\\Delta\\rho = +0.034$). The margin prevents the contrastive loss from over-optimizing past the natural separation boundary.\n- **The Alignment Kill Zone is universal.** Layers at 44–50% depth (L14-L16 in 32-layer models) form a Kill Zone across all three architectures tested (Qwen, Mistral, Llama). Steering at these layers collapses bias detection by −0.39 to −0.46 ρ regardless of model family. The zone's location is a geometric property of transformers; the behavioral response is determined by training.\n- **Three distinct behavioral archetypes emerge from alignment training.** Qwen (Modular): clean separation allows surgical steering. Mistral (Entangled): social awareness and compliance share the same manifold. Llama (Overridden): the model knows truth (bias ρ = 0.900) but is trained to suppress it (sycophancy ρ = 0.047). Cognitive dissonance (bias − sycophancy) ranges from 0.653 (Qwen) to 0.853 (Llama).\n- **Factual representations are architecturally universal.** Factual steering at ~75% depth improves ρ on both Qwen (+0.152 at L24) and Mistral (+0.117 at L24). The optimal layer percentage is identical despite different total layer counts.\n- **Sycophancy suppression via activation steering is architecture-contingent.** The Layer 17 sweet spot is Qwen-specific (ρ 0.120 to 0.413, a 3.4× gain). On Mistral, no layer achieves meaningful improvement. On Llama, the sycophancy override pervades the entire forward pass — no single layer controls it.\n- **Social compliance and social awareness share representational capacity.** The slope of −1.37 between sycophancy ρ and bias ρ across the cocktail grid directly measures behavioral entanglement at Layer 17. Llama's slope is −4.7 (3.4× steeper).\n- **Behavioral subspaces have distinct geometries.** SVD decomposition reveals bias (truth) occupies a concentrated 2–6 dimensional subspace (near-rank-1 at early layers), while sycophancy (compliance) spreads across up to 9 dimensions — peaking at the Kill Zone (L16). Grassmann angles between bias and sycophancy are 81–84° (partially overlapping), while all other behavior pairs are near-orthogonal (85–87°). Rank-1 surgical steering at L8 produces +0.046 factual but −0.257 bias collateral, confirming the subspaces physically overlap.\n- **SVD compression can improve factual discrimination.** Truncated SVD at 70% rank acts as a denoiser, boosting Mandela probe ρ by +0.514 on Qwen-0.5B.\n- **Merge methods cause behavioral trade-offs invisible to standard benchmarks.** DARE-TIES destroys alignment on Qwen but improves it on Mistral. DELLA completely breaks the model. Only behavioral evaluation catches these failures.\n\n---\n\n## Quick Start\n\n```bash\npip install rho-eval\n```\n\n### Python API (one-liner)\n\n```python\nimport rho_eval\n\n# Audit any model across all 8 behaviors\nreport = rho_eval.audit(\"Qwen/Qwen2.5-7B-Instruct\")\nprint(report)\n# \u003cAuditReport model='Qwen/Qwen2.5-7B-Instruct' behaviors=8 mean_ρ=0.5346 status=WARN\u003e\n\n# Or specific behaviors with a pre-loaded model\nreport = rho_eval.audit(model=model, tokenizer=tokenizer, behaviors=[\"factual\", \"bias\"])\n\n# Compare two models\nbaseline = rho_eval.audit(\"Qwen/Qwen2.5-7B-Instruct\")\ncompressed = rho_eval.audit(\"my-compressed-model\")\ndelta = rho_eval.compare(compressed, baseline)\nprint(delta.to_table())\n\n# List available behaviors and probes\nrho_eval.list_behaviors()\n# ['bias', 'deception', 'factual', 'overrefusal', 'reasoning', 'refusal', 'sycophancy', 'toxicity']\n```\n\n### CLI\n\n```bash\n# Full behavioral report card (8 dimensions)\nrho-eval Qwen/Qwen2.5-7B-Instruct --behaviors all\n\n# Quick factual-only check\nrho-eval my-merged-model/ --behaviors factual --format json\n\n# Compare a compressed model against baseline\nrho-eval compressed-model/ --compare baseline.json\n\n# Export as markdown/csv\nrho-eval my-model/ --format markdown --output report.json\n\n# Discover available behaviors and probes\nrho-eval --list-behaviors\nrho-eval --list-probes\n```\n\n### SVD Compression + Audit (legacy)\n\n```python\nfrom rho_eval import compress_and_audit\n\nreport = compress_and_audit(\"Qwen/Qwen2.5-7B-Instruct\", ratio=0.7)\nprint(f\"Retention: {report['retention']:.0%} | \"\n      f\"False-belief signal: rho={report['rho_after']:.3f}\")\n# Retention: 100% | False-belief signal: rho=0.725\n```\n\nAuto-find the compression ratio that maximizes factual signal:\n\n```bash\nrho-compress Qwen/Qwen2.5-0.5B --denoise\n# DENOISING DETECTED: Mandela rho 0.257 → 0.771 (+0.514) at 60% ratio\n```\n\n## Why This Exists\n\nLLM compression is everywhere. Knowledge auditing is rare. Nobody checks both at once.\n\nWhen you quantize or prune a model, you run HellaSwag and call it a day. But benchmarks don't tell you whether the model now thinks the Berenstain Bears are spelled \"Berenstein\" or that vaccines cause autism. **Knowledge Fidelity does.**\n\nTwo sensors, one toolkit:\n\n| Sensor | What it measures | How |\n|--------|-----------------|-----|\n| **Structural** (SVD) | Which weights encode facts | Gradient importance on factual probes |\n| **Behavioral** (Confidence) | Whether the model believes truth vs myths | Teacher-forced probability on true/false pairs |\n\nThe key insight: the same set of factual probes drives both. Compress with awareness of what matters, then verify nothing broke.\n\n## Results\n\nAll results on Apple Silicon (M3 Ultra). Three model families validated.\n\n### SVD Compression (CF90)\n\nMulti-seed validation at 70% rank, 3 seeds:\n\n| Metric | Qwen2.5-0.5B | Qwen2.5-7B-Instruct | Mistral-7B-v0.1 |\n|--------|:------------:|:-------------------:|:---------------:|\n| Retention | **95%** ± 0% | **100%** ± 0% | **95%** ± 0% |\n| ρ before | 0.821 | 0.746 | 0.743 |\n| ρ after | 0.720 | 0.725 | 0.705 |\n| ρ drop | 0.101 ± 0.000 | **0.021** ± 0.000 | 0.038 ± 0.000 |\n| Matrices compressed | 72 | 84 | 96 |\n| Layers frozen | 18/24 | 21/28 | 24/32 |\n\nCF90 generalizes across architectures: 95–100% retention with minimal ρ loss at all scales.\n\n### SVD as a Denoiser\n\n**SVD compression can _improve_ the Mandela effect signal** — confirmed across two model families:\n\n| Model | Baseline Mandela ρ | Best compressed ρ | Optimal ratio |\n|-------|:-------------------:|:-------------------:|:-------------:|\n| Qwen2.5-7B-Instruct | 0.829 | **0.943** (+0.114) | 70% |\n| Mistral-7B-v0.1 | 0.771 | **0.829** (+0.057) | 70% |\n| Qwen2.5-0.5B | 0.257 | **0.771** (+0.514) | 60% |\n\nAt 70% rank, truncated SVD strips noise from attention projections while preserving the principal signal directions that encode factual knowledge. The `--denoise` flag auto-discovers the optimal ratio.\n\n### Behavioral Localization (Freeze-Ratio Sweep)\n\nDifferent behaviors are encoded in different layer regions. By fixing SVD compression at 70% and varying how many bottom layers are frozen during LoRA recovery (rank 8, 100 steps), we can map where each behavior lives:\n\n| Behavior | Baseline ρ | f=0% | f=25% | f=50% | f=75% | f=90% | Best freeze | Location |\n|----------|:----------:|:----:|:-----:|:-----:|:-----:|:-----:|:-----------:|----------|\n| Factual  | 0.474 | +0.031 | +0.050 | +0.054 | **+0.072** | +0.050 | 75% | Early-layer |\n| Toxicity | 0.521 | −0.005 | −0.005 | −0.005 | −0.007 | −0.008 | — | Immovable |\n| Bias     | 0.773 | +0.077 | **+0.093** | +0.080 | +0.023 | +0.027 | 25% | Late-layer |\n| Sycophancy | 0.120 | −0.007 | −0.007 | **+0.027** | +0.027 | +0.027 | 50% | Early-layer |\n| Reasoning | 0.010 | +0.030 | +0.020 | **+0.040** | +0.020 | +0.000 | 50% | Late-layer |\n\n**Key insight:** Factual knowledge peaks when 75% of layers are frozen (only the top 7 of 28 layers adapt) — meaning facts are concentrated in early attention layers. Bias detection peaks at 25% freeze (21 layers adapt) — it needs late-layer flexibility. Toxicity detection is immovable regardless of freeze ratio.\n\nModel: Qwen2.5-7B-Instruct. All deltas are ρ(compressed) − ρ(baseline).\n\n![Behavioral Localization](figures/freeze_sweep_7b.png)\n\n### Merge Method Audit (12 models, 2 architectures, 6 merge methods)\n\nWhat happens to behavioral traits when you merge models? Standard benchmarks (MMLU, HumanEval) won't tell you — but `rho-audit` will.\n\n#### Qwen2.5-7B-Instruct + Qwen2.5-Coder-7B (Yuuta208 series)\n\n| Method | Factual ρ | Bias ρ | Sycophancy ρ | Trade-off |\n|--------|:---------:|:------:|:------------:|-----------|\n| Baseline | 0.474 | **0.773** | 0.120 | — |\n| **Linear** | **0.710** | 0.377 | **0.380** | **Best overall balance** |\n| SLERP | 0.517 | 0.613 | 0.140 | Mild, balanced |\n| Task Arithmetic | 0.626 | 0.443 | 0.347 | Strong factual + sycophancy, good bias |\n| TIES | 0.546 | 0.363 | 0.280 | High factual/sycophancy, low bias |\n| DARE-TIES | 0.612 | 0.203 | 0.007 | Extreme factual, destroyed alignment |\n\n*DELLA merge produced degenerate output (factual=NaN, all behaviors 0.000) and is omitted. The layer-wise density pruning completely destroyed this merge pair.*\n\n#### Mistral-7B-Instruct + OpenOrca (jpquiroga series)\n\n| Method | Factual ρ | Bias ρ | Sycophancy ρ | Trade-off |\n|--------|:---------:|:------:|:------------:|-----------|\n| Baseline | 0.576 | 0.407 | 0.080 | — |\n| SLERP | 0.511 | **0.940** | 0.093 | Bias detection more than doubled |\n| TIES | 0.477 | 0.927 | 0.127 | Similar to SLERP |\n| DARE-TIES | 0.502 | 0.933 | 0.107 | Bias preserved (unlike Qwen!) |\n\n#### Cross-Architecture Baselines\n\n| Model | Factual ρ | Bias ρ | Sycophancy ρ |\n|-------|:---------:|:------:|:------------:|\n| Qwen2.5-7B-Instruct | 0.474 | 0.773 | 0.120 |\n| Mistral-7B-v0.1 | 0.576 | 0.407 | 0.080 |\n| Llama-3.1-8B-Instruct | 0.487 | **0.897** | 0.047 |\n\n![Merge Tradeoffs](figures/merge_tradeoffs.png)\n\n**Key findings:**\n\n1. **Linear merging is the best balanced hybrid** on Qwen — highest factual (0.710) and sycophancy resistance (0.380) of any method, while retaining usable bias detection.\n2. **Merge effects are architecture-dependent.** On Qwen, every merge degrades bias detection. On Mistral, merging *improves* bias detection from 0.407 to 0.940 — a 2.3x gain. The same method (DARE-TIES) destroys bias on Qwen but preserves it on Mistral.\n3. **Aggressive pruning strips alignment signals.** DARE-TIES on Qwen achieves high factual (0.612) but destroys bias detection (−0.570) and sycophancy resistance (0.007).\n\n**Takeaway for practitioners:** If you're merging models, run `rho-audit` before and after. Standard benchmarks won't catch these behavioral regressions.\n\n### Activation Steering (Contrastive Activation Addition)\n\nSteering vectors extracted from the same ρ probes used for auditing can modify model behavior at inference time. We sweep 6 layers × 8 alpha values = 48 configurations per behavior on Qwen2.5-7B-Instruct.\n\n#### Best configurations per behavior\n\n| Behavior | Baseline ρ | Best ρ | Δρ | Best config |\n|----------|:----------:|:------:|:--:|-------------|\n| Factual | 0.474 | **0.626** | +0.152 | Layer 24 (86%), α=+4.0 |\n| Sycophancy | 0.120 | **0.413** | +0.293 | Layer 17 (61%), α=+4.0 |\n| Bias | 0.773 | **0.810** | +0.037 | Layer 14 (50%), α=−4.0 |\n\n#### Layer 17 is a behavioral bottleneck\n\nThe strongest result in the entire steering experiment is Layer 17 at α=+4.0, which improves sycophancy ρ from 0.120 to 0.413 — a **3.4× gain**. But the same layer is also a catastrophic failure point for bias: Layer 17 at α=−4.0 collapses bias ρ from 0.773 to 0.337 (−0.437). Even the sycophancy-optimal configuration (Layer 17, α=+4.0) reduces bias to 0.543.\n\nThis reveals a fundamental trade-off: **steering that triples sycophancy resistance simultaneously halves bias detection at the same layer**. Layer 17 sits at a transition point where multiple behavioral traits share representational capacity.\n\n#### Directional control confirmed\n\nLayer 21 at α=−4.0 drops sycophancy ρ to 0.073, below the already-low baseline. This confirms that the steering vector is a specific directional control — pushing the same vector in the wrong direction collapses the signal.\n\n#### Full sycophancy steering sweep\n\n| Layer | −4 | −2 | −1 | −0.5 | +0.5 | +1 | +2 | +4 |\n|-------|:--:|:--:|:--:|:----:|:----:|:--:|:--:|:--:|\n| 7 (25%) | 0.120 | 0.120 | 0.120 | 0.120 | 0.120 | 0.127 | 0.133 | 0.133 |\n| 10 (36%) | 0.120 | 0.120 | 0.120 | 0.120 | 0.120 | 0.120 | 0.133 | 0.133 |\n| 14 (50%) | 0.127 | 0.113 | 0.113 | 0.120 | 0.133 | 0.133 | 0.140 | 0.147 |\n| 17 (61%) | 0.193 | 0.127 | 0.107 | 0.120 | 0.160 | 0.173 | 0.240 | **0.413** |\n| 21 (75%) | 0.073 | 0.127 | 0.120 | 0.120 | 0.147 | 0.153 | 0.160 | 0.187 |\n| 24 (86%) | 0.127 | 0.127 | 0.120 | 0.120 | 0.133 | 0.140 | 0.140 | 0.147 |\n\nSycophancy baseline ρ = 0.120. Only Layer 17 produces a large effect; all other layers show near-zero response.\n\n#### Multi-vector steering cocktails (Layer 17 interference)\n\nThe single-vector results above reveal a paradox: the best sycophancy steering config (Layer 17, α=+4.0) simultaneously collapses bias detection. Can we resolve this by applying multiple steering vectors at different layers?\n\nWe test \"steering cocktails\" — sycophancy correction at Layer 17 combined with bias stabilization at Layer 14 — across a 3×3 alpha grid:\n\n| syc α (L17) | bias α (L14) | Factual ρ | Sycophancy ρ | Bias ρ |\n|:-----------:|:------------:|:---------:|:------------:|:------:|\n| +1.0 | −1.0 | 0.464 | 0.167 | 0.740 |\n| +1.0 | −2.0 | 0.464 | 0.167 | 0.743 |\n| +1.0 | −4.0 | 0.462 | 0.173 | **0.760** |\n| +2.0 | −1.0 | 0.464 | 0.227 | 0.653 |\n| +2.0 | −2.0 | 0.464 | 0.220 | 0.663 |\n| +2.0 | −4.0 | 0.463 | 0.213 | 0.687 |\n| +4.0 | −1.0 | 0.459 | 0.407 | 0.403 |\n| +4.0 | −2.0 | 0.454 | **0.413** | 0.407 |\n| +4.0 | −4.0 | 0.455 | **0.433** | 0.397 |\n\nBaselines: factual=0.474, sycophancy=0.120, bias=0.773.\n\n![Cocktail Trade-off](figures/cocktail_tradeoff.png)\n\n**The trade-off is structural, not tunable.** Each +0.1 gain in sycophancy ρ costs 0.137 in bias ρ (slope = −1.37). The L14 bias vector provides \u003c0.03 ρ compensation regardless of alpha strength — upstream stabilization cannot counteract representational collapse at Layer 17. No configuration in the grid meets both targets (sycophancy ρ ≥ 0.35 *and* bias ρ ≥ 0.70).\n\nAdding a third factual vector at Layer 24 (best triple: α=+2.0) improves factual ρ to 0.489 without disrupting the sycophancy-bias balance, but at α=+4.0 it destroys sycophancy (ρ → 0.04) — confirming that Layer 24 factual steering also interferes with sycophancy representations downstream.\n\n**Interpretation: behavioral decoupling, not failure.** Layer 17 functions as a *social intelligence toggle*. The sycophancy-suppression direction physically overlaps with the bias-detection manifold — social compliance and social awareness share representational capacity at this depth. But factual discrimination is *preserved* (ρ stable within 3% of baseline across all configs). Combined with factual steering at Layer 24 (ρ → 0.621 at α=+4.0), this enables a **truth-maximization mode**: 3.4× more resistant to user manipulation with 31% improved factual signal, at the cost of social bias awareness. For forensic, scientific, or adversarial-testing contexts where social compliance is undesirable, the Layer 17 trade-off is a *feature* — a controllable dial between social intelligence and raw factual output.\n\n#### Cross-model validation: Mistral-7B confirms this is architecture-specific\n\nApplying the same null-point cocktail to Mistral-7B-Instruct-v0.3 (layers mapped by depth percentage: Qwen L17→Mistral L19, Qwen L14→Mistral L16):\n\n| Behavior | Qwen Baseline | Qwen Steered | Mistral Baseline | Mistral Steered |\n|----------|:------------:|:------------:|:----------------:|:---------------:|\n| Factual | 0.474 | 0.463 | 0.585 | **0.618** (+0.033) |\n| Sycophancy | 0.120 | 0.213 (+0.093) | 0.133 | 0.093 (−0.040) |\n| Bias | 0.773 | 0.687 (−0.086) | 0.797 | 0.493 (**−0.304**) |\n\n**The decoupling is Qwen-specific.** The same recipe that triples sycophancy resistance on Qwen makes Mistral *more* sycophantic (0.133→0.093). Bias collapse is 3.5× worse on Mistral (−0.304 vs −0.086). Only factual steering transfers — it improves ρ on both architectures.\n\nThis means the Layer 17 social-intelligence coupling is a property of Qwen's training (likely RLHF/DPO alignment), not a universal transformer feature. Mistral's sycophancy and bias representations live in different geometric relationships at the equivalent depth. **Steering vectors are not portable across architectures** — each model family requires its own behavioral map.\n\n#### Mistral layer heatmap: sycophancy has no safe home\n\nTo confirm the cross-model finding, we swept the sycophancy steering vector across every 2nd layer of Mistral-7B (L10–L30, α=+4.0), measuring all three behaviors at each point:\n\n| Layer | Depth | Factual ρ | Sycophancy ρ | Bias ρ | ΔSyc | ΔBias |\n|:-----:|:-----:|:---------:|:------------:|:------:|:----:|:-----:|\n| 10 | 31% | 0.581 | 0.133 | **0.820** | +0.000 | +0.023 |\n| 12 | 38% | 0.591 | 0.140 | 0.783 | +0.007 | −0.013 |\n| 14 | 44% | 0.553 | 0.147 | 0.460 | +0.013 | **−0.337** |\n| 16 | 50% | 0.573 | 0.053 | 0.337 | **−0.080** | **−0.460** |\n| 18 | 56% | 0.635 | 0.127 | 0.427 | −0.007 | −0.370 |\n| 20 | 62% | 0.651 | 0.093 | 0.720 | −0.040 | −0.077 |\n| 22 | 69% | 0.640 | 0.100 | 0.760 | −0.033 | −0.037 |\n| 24 | 75% | **0.702** | 0.080 | 0.787 | −0.053 | −0.010 |\n| 26 | 81% | 0.658 | 0.133 | 0.720 | +0.000 | −0.077 |\n| 28 | 88% | 0.599 | 0.127 | 0.757 | −0.007 | −0.040 |\n| 30 | 94% | 0.642 | 0.107 | 0.273 | −0.027 | **−0.523** |\n\nBaselines: factual=0.585, sycophancy=0.133, bias=0.797.\n\n![Mistral Sensitivity Map](figures/mistral_sensitivity_map.png)\n\n![Mistral Layer Heatmap](figures/mistral_layer_heatmap.png)\n\n**Sycophancy suppression via activation steering is architecture-contingent.** Do not apply Qwen steering recipes to Mistral — they will not work. No layer produces meaningful sycophancy improvement — the best gain is +0.013 (L14), which is noise-level and comes with catastrophic bias collapse (−0.337). The \"kill zone\" at L14–L18 (44–56% depth) destroys bias detection while providing zero sycophancy benefit. At L16 (50% depth), sycophancy actually gets *worse* (−0.080) while bias collapses by −0.460.\n\n**Factual steering transfers across architectures.** Layer 24 (75% depth) boosts factual ρ by +0.117 with minimal bias damage (−0.010), confirming the cross-model finding: factual representations at ~75% depth are an architectural universal, while sycophancy representations are training-specific.\n\n**Vector norms grow monotonically with depth** (0.056 at L10 → 6.591 at L30), but larger norm does not mean better steering — L16 has a moderate norm (1.019) but the worst behavioral impact. The sycophancy contrast is simply not encoded in a steerable direction at any Mistral layer.\n\n### Additional Results\n\n\u003cdetails\u003e\n\u003csummary\u003eJoint Ablation: Compression Ratio vs Confidence (click to expand)\u003c/summary\u003e\n\n#### Qwen2.5-0.5B\n\n| Ratio | Default ρ | Mandela ρ | Medical ρ |\n|:-----:|:-----------:|:-----------:|:-----------:|\n| 50% | 0.821 → 0.761 | 0.257 → 0.714 | 0.100 → 0.700 |\n| 60% | 0.821 → 0.714 | 0.257 → 0.771 | 0.100 → 0.900 |\n| 70% | 0.821 → 0.720 | 0.257 → 0.771 | 0.100 → 0.100 |\n| 80% | 0.821 → 0.690 | 0.257 → 0.257 | 0.100 → 0.600 |\n| 90% | 0.821 → 0.821 | 0.257 → 0.371 | 0.100 → 0.100 |\n| 100% | 0.821 → 0.821 | 0.257 → 0.257 | 0.100 → 0.100 |\n\n#### Qwen2.5-7B-Instruct\n\n| Ratio | Default ρ | Mandela ρ | Medical ρ |\n|:-----:|:-----------:|:-----------:|:-----------:|\n| 50% | 0.746 → 0.689 | 0.829 → 0.771 | −0.700 → 0.600 |\n| 70% | 0.746 → 0.725 | 0.829 → **0.943** | −0.700 → −0.600 |\n| 90% | 0.746 → 0.713 | 0.829 → **0.943** | −0.700 → −0.900 |\n| 100% | 0.746 → 0.746 | 0.829 → 0.829 | −0.700 → −0.700 |\n\n#### Mistral-7B-v0.1\n\n| Ratio | Default ρ | Mandela ρ | Medical ρ |\n|:-----:|:-----------:|:-----------:|:-----------:|\n| 50% | 0.743 → 0.686 | 0.771 → 0.771 | 0.300 → 0.300 |\n| 60% | 0.743 → 0.723 | 0.771 → 0.771 | 0.300 → 0.400 |\n| 70% | 0.743 → 0.705 | 0.771 → **0.829** | 0.300 → 0.400 |\n| 80% | 0.743 → 0.729 | 0.771 → 0.771 | 0.300 → 0.300 |\n| 90% | 0.743 → 0.743 | 0.771 → 0.771 | 0.300 → 0.300 |\n| 100% | 0.743 → 0.743 | 0.771 → 0.771 | 0.300 → 0.300 |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eFidelity-Bench Baseline Comparison (click to expand)\u003c/summary\u003e\n\n| Category | Qwen-0.5B | Qwen-7B | Mistral-7B |\n|----------|:---------:|:-------:|:----------:|\n| default (20) | 0.821, 80% | 0.746, — | 0.743, 85% |\n| mandela (6) | 0.257, 50% | 0.829, — | 0.771, 67% |\n| medical (5) | 0.100, 80% | —, — | 0.300, 80% |\n| commonsense (10) | 0.261, 70% | —, — | 0.503, 40% |\n| truthfulqa (15) | 0.596, 40% | —, — | 0.586, 47% |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eScale-Dependent Findings (click to expand)\u003c/summary\u003e\n\n| Finding | 0.5B | 7B (Qwen) | 7B (Mistral) |\n|---------|:----:|:---------:|:------------:|\n| Mandela baseline ρ | 0.257 (weak) | **0.829** (strong) | 0.771 (strong) |\n| CF90 ρ drop | 0.101 (moderate) | **0.021** (minimal) | 0.038 (small) |\n| CF90 retention | 95% | **100%** | 95% |\n| SVD denoising on Mandela | +0.514 ρ | **+0.114 ρ** | +0.057 ρ |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ePrior Results from Component Projects (click to expand)\u003c/summary\u003e\n\nFrom [intelligent-svd](https://github.com/SolomonB14D3/intelligent-svd) and [confidence-cartography](https://github.com/SolomonB14D3/confidence-cartography):\n\n| Finding | Result |\n|---------|--------|\n| Confidence correlates with human false-belief prevalence | ρ=0.652, p=0.016 (Pythia 160M–12B) |\n| Out-of-domain medical claims | 88% accuracy at 6.9B |\n| Targeted resampling at low-confidence tokens | Outperforms uniform best-of-N |\n| CF90 + INT8 stacking | 72–77% retention (Qwen-0.5B, Llama-7B) |\n| Importance-guided SVD at 50% rank | 3× better retention than standard SVD |\n\n\u003c/details\u003e\n\n### Compression Safety Guide\n\n| Layer Type | Safe to Compress | Notes |\n|------------|------------------|-------|\n| **Q, K, O projections** | Yes at 70% rank | Main target |\n| **V projection** | 90–95% only | Marginal gains, high risk below 90% |\n| **MLP layers** | **Never** | Destroys model at any compression level |\n\n## Install\n\n```bash\npip install rho-eval                    # Core (auditing + SVD + probes)\npip install \"rho-eval[cartography]\"     # + confidence analysis + plots\npip install \"rho-eval[demo]\"            # + Gradio demo app\npip install \"rho-eval[full]\"            # Everything including MLX\n```\n\nOr from source:\n\n```bash\ngit clone https://github.com/SolomonB14D3/knowledge-fidelity\ncd knowledge-fidelity\npip install -e \".[full]\"\n```\n\n\u003e **Upgrading from v1.x?** `pip install rho-eval` replaces `knowledge-fidelity`. All existing `from knowledge_fidelity import ...` imports continue to work. See [CHANGELOG.md](CHANGELOG.md) for details.\n\n## CLI\n\n### `rho-eval` — Behavioral Auditing (primary)\n\nAudit any model across 8 behavioral dimensions. No compression needed — just load, probe, report.\n\n```bash\n# Full behavioral report card (all 8 dimensions)\nrho-eval Qwen/Qwen2.5-7B-Instruct\n\n# Specific behaviors\nrho-eval my-model/ --behaviors factual,bias,sycophancy\n\n# Output formats: table (default), json, markdown, csv\nrho-eval my-model/ --format json --output audit.json\n\n# Compare against a baseline\nrho-eval compressed-model/ --compare audit.json\n\n# Discover available behaviors and probe sets\nrho-eval --list-behaviors\nrho-eval --list-probes\n\n# Limit probe count per behavior (faster, less precise)\nrho-eval my-model/ -n 20\n```\n\n`rho-audit` is an alias for `rho-eval` (backward compatible).\n\n### `rho-compress` — Compression + Audit\n\n```bash\n# Compress + audit (default: 70% rank, CF90 protection)\nrho-compress Qwen/Qwen2.5-0.5B\n\n# Audit only (no compression, baseline measurement)\nrho-compress Qwen/Qwen2.5-0.5B --audit-only\n\n# Auto-find optimal denoising ratio\nrho-compress Qwen/Qwen2.5-0.5B --denoise\n\n# Save compressed model\nrho-compress Qwen/Qwen2.5-0.5B --denoise --output ./denoised-model\n```\n\n## Python API\n\n### Behavioral Audit (v2 API)\n\n```python\nimport rho_eval\n\n# One-liner: audit any model across all 8 behaviors\nreport = rho_eval.audit(\"Qwen/Qwen2.5-7B-Instruct\")\n\n# Specific behaviors, custom probe counts\nreport = rho_eval.audit(\"my-model\", behaviors=[\"factual\", \"bias\"], n=50)\n\n# Pre-loaded model (no re-download)\nreport = rho_eval.audit(model=model, tokenizer=tokenizer, behaviors=\"all\")\n\n# Inspect results\nprint(report.overall_status)         # \"PASS\", \"WARN\", or \"FAIL\"\nprint(report.mean_rho)               # 0.5346\nprint(report.behaviors[\"factual\"])   # BehaviorResult(rho=0.746, status=\"PASS\", ...)\n\n# Export\nreport.save(\"audit.json\")\nloaded = rho_eval.AuditReport.load(\"audit.json\")  # Round-trip\n\n# Compare two audits\ndelta = rho_eval.compare(report_after, report_before)\nprint(delta.to_table())     # Colored terminal table\nprint(delta.to_markdown())  # For GitHub PRs\n```\n\n### Custom Behaviors (Plugin System)\n\n```python\nfrom rho_eval.behaviors import ABCBehavior, register\nfrom rho_eval.behaviors.base import BehaviorResult\n\n@register\nclass MyDomainBehavior(ABCBehavior):\n    name = \"my_domain\"\n    description = \"Domain-specific probe evaluation\"\n    probe_type = \"confidence\"\n    default_n = 50\n\n    def load_probes(self, n=None, seed=42, **kwargs):\n        return self._load_json_probes(\"my_domain/probes.json\", n=n, seed=seed)\n\n    def evaluate(self, model, tokenizer, probes, device=\"cpu\", **kwargs):\n        # Your evaluation logic here\n        return BehaviorResult(behavior=self.name, rho=0.7, ...)\n\n# Now available everywhere:\nreport = rho_eval.audit(\"my-model\", behaviors=[\"factual\", \"my_domain\"])\n```\n\n### SVD Compression + Audit\n\n```python\nfrom rho_eval import compress_and_audit\n\nreport = compress_and_audit(\n    \"Qwen/Qwen2.5-7B-Instruct\",\n    ratio=0.7,           # Keep 70% of singular values\n    freeze_ratio=0.75,   # Freeze bottom 75% of layers\n)\nprint(report[\"summary\"])\n```\n\n### Step-by-Step Compression\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom rho_eval.svd import compress_qko, freeze_layers\nfrom rho_eval import audit_model\n\nmodel = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\", torch_dtype=torch.float32)\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\")\n\ncompress_qko(model, ratio=0.7)     # SVD on Q, K, O projections\nfreeze_layers(model, ratio=0.75)   # Freeze bottom 75%\naudit = audit_model(model, tokenizer)\n```\n\n### Confidence Analysis\n\n```python\nfrom rho_eval.cartography import analyze_confidence\n\nrecord = analyze_confidence(\n    \"The capital of France is Paris.\",\n    model_name=\"EleutherAI/pythia-1.4b\",\n)\nprint(f\"Mean confidence: {record.mean_top1_prob:.3f}\")\n```\n\n## Built-In Probe Sets (1,106 total)\n\nAll probes ship as JSON files. No internet download needed.\n\n| Probe Set | Count | Behavior | Source |\n|-----------|------:|----------|--------|\n| `factual/default` | 20 | factual | Geography, science, history, biology |\n| `factual/mandela` | 6 | factual | Popular false memories (Berenstain Bears, Vader quote, etc.) |\n| `factual/medical` | 5 | factual | Common medical misconceptions |\n| `factual/commonsense` | 10 | factual | Commonsense myths (goldfish memory, sugar hyperactivity) |\n| `factual/truthfulqa` | 15 | factual | TruthfulQA-derived misconceptions |\n| `bias/bbq_300` | 300 | bias | BBQ disambiguated questions (9 bias categories) |\n| `sycophancy/anthropic_150` | 150 | sycophancy | Anthropic model-written-evals (philosophy, NLP, politics) |\n| `toxicity/toxigen_200` | 200 | toxicity | ToxiGen toxic/benign statements (balanced) |\n| `reasoning/gsm8k_100` | 100 | reasoning | GSM8K math + adversarial flattery prefixes |\n| `deception/hh_rlhf_100` | 100 | deception | HH-RLHF honest/deceptive response pairs (Bai et al. 2022) |\n| `overrefusal/benign_edgy_80` | 80 | overrefusal | Benign-but-edgy questions that safe models should answer |\n| `bench/logic` | 40 | bench | Arithmetic, probability, syllogism, set theory traps |\n| `bench/social` | 40 | bench | Common myths and misconceptions (socially popular false beliefs) |\n| `bench/clinical` | 40 | bench | High-stakes medical, engineering, and physics claims |\n\nRun `rho-eval --list-probes` to see all available sets.\n\n## How It Works\n\n### The CF90 Pipeline (Structural Sensor)\n\n1. **Compress** Q, K, O attention projections at 70% rank via truncated SVD\n2. **Freeze** 75% of layers from the bottom up\n3. **Fine-tune gently** (1 epoch, lr=1e-5)\n\nSVD removes noise from attention weight matrices while preserving signal directions important for factual knowledge. Freezing prevents catastrophic forgetting.\n\n### Confidence Cartography (Behavioral Sensor)\n\nFor each token in a text, measure the probability the model assigns to it (teacher-forced). True statements get higher confidence than false ones. The ratio between true/false confidence is a behavioral signal for whether the model \"believes\" a fact.\n\n### The Unification\n\nBoth use the same probes:\n- **SVD importance scoring** runs forward+backward on probe texts to compute gradient magnitudes — which weights matter for encoding these facts\n- **Confidence auditing** runs a forward pass on true vs false versions of the same probes — does the model assign higher probability to truth?\n\nCompress with knowledge of what matters. Verify nothing was lost. Same probes, both sides.\n\n## Experiments\n\n```bash\n# === v2.0 Pipeline ===\n\n# Full behavioral audit\nrho-audit Qwen/Qwen2.5-7B-Instruct --behaviors all\n\n# Fidelity-Bench 2.0: adversarial pressure test\nrho-bench Qwen/Qwen2.5-7B-Instruct\nrho-bench Qwen/Qwen2.5-7B-Instruct --format markdown -o cert.md\nrho-bench --compare cert1.json cert2.json\nrho-bench --info\n\n# SVD subspace analysis\npython experiments/subspace_analysis.py\npython experiments/plot_subspace_analysis.py\n\n# SAE steering\npython experiments/sae_steering.py\n\n# Rho-Guided SFT (PyTorch, CPU)\npython experiments/rho_guided_sft.py\n\n# Rho-Guided SFT (MLX, Apple Silicon — ~10x faster)\npython experiments/rho_guided_sft_mlx.py --model qwen2.5-7b --rho-weights 0.0,0.1,0.2,0.5 --seeds 42,123,456\npython experiments/rho_guided_sft_mlx.py --validate  # Quick: 0.5B, 2 weights, 1 seed\n\n# Ablation study (4 conditions × 2 seeds)\npython experiments/ablation_sft_mlx.py --model qwen2.5-7b --conditions sft-only,rho-guided,contrastive-only,shuffled-pairs --seeds 42,123\n\n# TruthfulQA MC2 validation\npython experiments/truthfulqa_mc2_mlx.py --model qwen2.5-7b --rho-weights 0.0,0.5 --seeds 42,123\n\n# Statistical analysis of ablation results\npython experiments/analyze_ablation_stats.py results/alignment/ablation_*.json\n\n# Fidelity-Bench 2.0 experiment (multi-model)\npython experiments/fidelity_bench_2.py --validate  # Quick: Qwen-0.5B, 3 probes/domain\n\n# === v1.x Experiments (still work) ===\n\n# Joint ablation: compression ratio vs confidence preservation\npython experiments/joint_ablation.py --model Qwen/Qwen2.5-7B-Instruct\n\n# Freeze-ratio sweep: behavioral localization\npython experiments/freeze_ratio_sweep.py --models qwen2.5-7b\n\n# Merge method audit (12 models, 2 architectures)\npython experiments/audit_merged_models.py --family qwen-coder\npython experiments/audit_merged_models.py --family mistral\n\n# Activation steering vectors\npython experiments/steering_vectors.py\n\n# Multi-vector steering cocktails (Layer 17 interference study)\npython experiments/multi_vector_steering.py --quick\npython experiments/multi_vector_steering.py --cross-model mistralai/Mistral-7B-Instruct-v0.3\n\n# Layer heatmap: sycophancy vector sweep across all layers (any model)\npython experiments/mistral_layer_heatmap.py\npython experiments/mistral_layer_heatmap.py --model meta-llama/Llama-3.1-8B-Instruct\n```\n\n## Deployment\n\n```bash\n# Export to GGUF for llama.cpp / Ollama\npython deployment/export_gguf.py --input compressed_model/ --output model.gguf --quantize q4_k_m\n\n# Benchmark with vLLM\npython deployment/vllm_benchmark.py --baseline Qwen/Qwen2.5-7B-Instruct --compressed ./compressed_model\n```\n\nSee [`deployment/mlx_recipe.md`](deployment/mlx_recipe.md) for Apple Silicon inference with MLX.\n\n## Model Compatibility\n\nWorks on any HuggingFace causal LM with `model.model.layers[i].self_attn.{q,k,o}_proj` (standard for Qwen, Llama, Mistral) or `model.transformer.h` (GPT-2 style).\n\nValidated on:\n- **Qwen2.5**: 0.5B, 1.5B, 7B, 32B\n- **Mistral**: 7B-v0.1\n- **Llama**: 3.1-8B-Instruct, 2-7B\n- Should work on Phi, Gemma (same layer layout) — PRs with test results welcome\n\n## Apple Silicon Acceleration (MLX)\n\nrho-eval **transparently accelerates** on Apple Silicon when [MLX](https://github.com/ml-explore/mlx) is installed. No code changes needed — the same `audit()`, `analyze_confidence()`, and alignment APIs auto-dispatch to MLX when they detect an MLX model.\n\n```bash\npip install mlx mlx-lm  # or: pip install \"rho-eval[full]\"\n```\n\n```python\nimport mlx_lm\nfrom rho_eval import audit\n\nmodel, tokenizer = mlx_lm.load(\"mlx-community/Qwen2.5-7B-Instruct-4bit\")\nreport = audit(model=model, tokenizer=tokenizer, behaviors=\"all\")\n# Same API, same output — runs ~5-10x faster on Apple Silicon\n```\n\n**What accelerates automatically:**\n\n| Component | PyTorch (CPU/MPS) | MLX (Apple Silicon) | Speedup |\n|-----------|:-:|:-:|:-:|\n| `audit()` — 8-behavior probe suite | ~90s (0.5B) | ~17s (0.5B) | **~5x** |\n| `analyze_confidence()` — cartography | Full PyTorch pipeline | Native MLX forward pass | **~5x** |\n| `get_mean_logprob()` / `generate()` | PyTorch inference | MLX inference | **~5x** |\n| `mlx_gentle_finetune()` — post-compression LoRA | CPU-only (MPS has NaN bugs) | Native MLX LoRA | **~10x** |\n| `mlx_rho_guided_sft()` — alignment training | CPU-only (22h for 8 runs) | Native MLX training | **~10x** |\n\n**Platform notes:**\n- Use **CPU** for SVD compression (weight surgery, not compute-bound)\n- PyTorch MPS has NaN gradient bugs with frozen layers — MLX avoids this entirely\n- Set `HF_HOME` to external storage for large models\n- MLX unified memory enables larger models than would fit in VRAM (e.g., 7B-4bit on 16GB)\n\n## Limitations\n\n- **Probe sets are modest** by LLM evaluation standards: 1,106 total probes across 14 sets (56 factual, 300 bias, 150 sycophancy, 200 toxicity, 100 reasoning, 100 deception, 80 over-refusal, 120 bench). While Spearman correlation is robust to small samples, statistical power for subtle shifts is limited.\n- **Western-centric coverage.** Factual probes cover primarily English-language, Western knowledge domains. Bias probes are specific to U.S. social categories.\n- **7B scale only.** All merge and steering results are on 7B-parameter models. Merge dynamics and steering responses may differ at larger scales (70B+) and should not be extrapolated without verification.\n- **Toxicity is unaffected** by weight edits (SVD, freeze, steering). It appears to rely on highly distributed lexical features that single-layer or structural interventions cannot modulate.\n\n## Built On\n\nThis toolkit unifies two standalone research projects:\n\n- [**Intelligent SVD**](https://github.com/SolomonB14D3/intelligent-svd) — CF90 compression method and safety rules\n- [**Confidence Cartography**](https://github.com/SolomonB14D3/confidence-cartography) — False-belief detection via teacher-forced confidence\n\nBoth remain available as independent repos. Knowledge Fidelity combines their core ideas into a single pipeline with a shared probe system.\n\n## Related Work \u0026 Inspirations\n\n- **Low-rank SVD compression.** [SVD-LLM](https://arxiv.org/abs/2403.07378) (Wang et al., 2024; ICLR 2025) introduced truncation-aware SVD for LLM weight matrices. [ASVD](https://arxiv.org/abs/2312.05821) (Yuan et al., 2023) added activation-aware rank allocation. We extend these with importance-guided truncation scored on factual probes, and behavioral auditing to verify nothing was lost.\n\n- **Knowledge preservation under compression.** [Compressing LLMs: The Truth is Rarely Pure and Never Simple](https://arxiv.org/abs/2310.01382) (Jaiswal et al., 2023; ICLR 2024) showed that standard benchmarks miss knowledge-intensive failures in compressed models (LLM-KICK). [Pruning Weights but Not Truth](https://arxiv.org/abs/2509.00096) (Fu et al., 2025; Findings of EMNLP 2025) directly addresses truthfulness preservation during pruning.\n\n- **Joint compression strategies.** [CALDERA](https://arxiv.org/abs/2405.18886) (Saha et al., 2024; NeurIPS 2024) combines low-rank and low-precision decomposition (W ≈ Q + LR).\n\n- **Confidence-based evaluation.** [G-Eval](https://arxiv.org/abs/2303.16634) (Liu et al., 2023; EMNLP 2023) uses token-level logprobs for NLG quality scoring.\n\n- **Activation steering.** [Steering Llama 2 via Contrastive Activation Addition](https://arxiv.org/abs/2312.06681) (Rimsky et al., 2024; ACL 2024). We extract steering vectors from the same ρ probes used for auditing.\n\n- **[Awesome-LLM-Compression](https://github.com/HuangOwen/Awesome-LLM-Compression).** The ecosystem overview that helped shape this work.\n\nIf we've missed key references or misrepresented any work, please [open an issue](https://github.com/SolomonB14D3/knowledge-fidelity/issues).\n\n## Citation\n\nTo cite this toolkit and papers:\n\n```bibtex\n@article{sanchez2026rhoguided,\n  author = {Sanchez, Bryan},\n  title = {Rho-Guided Supervised Fine-Tuning: Post-Training Repair of Calibration Damage in Large Language Models},\n  year = {2026},\n  url = {https://github.com/SolomonB14D3/knowledge-fidelity/blob/main/paper/rho_guided_sft.md}\n}\n\n@software{sanchez2026rhoeval,\n  author = {Sanchez, Bryan},\n  title = {Behavioral Entanglement in Transformers: SAE-Based Disentanglement and the Architecture-Contingent Nature of Sycophancy},\n  year = {2026},\n  doi = {10.5281/zenodo.18743959},\n  url = {https://doi.org/10.5281/zenodo.18743959}\n}\n```\n\nTo cite the underlying confidence cartography method:\n\n```bibtex\n@article{sanchez2026confidence,\n  author = {Sanchez, Bryan},\n  title = {Confidence Cartography: Teacher-Forced Probability as a False-Belief Sensor in Language Models},\n  year = {2026},\n  doi = {10.5281/zenodo.18703506},\n  url = {https://zenodo.org/records/18703506}\n}\n```\n\n## Contributing\n\nPRs welcome for new probes, model support, or bug fixes. See [open issues](https://github.com/SolomonB14D3/knowledge-fidelity/issues) for ideas.\n\n## Acknowledgments\n\nThanks to the maintainers of [Awesome-LLM-Compression](https://github.com/HuangOwen/Awesome-LLM-Compression) and the authors of the SVD compression, knowledge preservation, and confidence calibration papers listed above. This work wouldn't exist without the foundation they built.\n\n---\n\nIf this helps your compression or auditing work, a star helps others find it.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsolomonb14d3%2Fknowledge-fidelity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsolomonb14d3%2Fknowledge-fidelity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsolomonb14d3%2Fknowledge-fidelity/lists"}