{"id":50893331,"url":"https://github.com/mcpware/intentprobe","last_synced_at":"2026-06-15T22:30:46.054Z","repository":{"id":363061368,"uuid":"1255686127","full_name":"mcpware/intentprobe","owner":"mcpware","description":"Activation-probe security scanner for AI agent tooling. Reads a model's internal activations to detect poisoned MCP servers, skills, and packages before install.","archived":false,"fork":false,"pushed_at":"2026-06-07T08:47:08.000Z","size":7698,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-07T09:22:49.798Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mcpware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-01T04:44:41.000Z","updated_at":"2026-06-07T08:47:12.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mcpware/intentprobe","commit_stats":null,"previous_names":["mcpware/intentprobe"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/mcpware/intentprobe","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcpware%2Fintentprobe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcpware%2Fintentprobe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcpware%2Fintentprobe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcpware%2Fintentprobe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mcpware","download_url":"https://codeload.github.com/mcpware/intentprobe/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mcpware%2Fintentprobe/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34383468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-15T22:30:45.214Z","updated_at":"2026-06-15T22:30:46.049Z","avatar_url":"https://github.com/mcpware.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IntentProbe\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eThe First and Only MCP scanner that reads what the model understood, not what the text says.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/mcpware/IntentProbe/stargazers\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/mcpware/IntentProbe?style=social\" alt=\"Stars\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/mcpware/IntentProbe/network/members\"\u003e\u003cimg src=\"https://img.shields.io/github/forks/mcpware/IntentProbe?style=social\" alt=\"Forks\" /\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Python-3.10%2B-blue?logo=python\u0026logoColor=white\" alt=\"Python 3.10+\" /\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-Apache%202.0-blue\" alt=\"License\" /\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/runs-100%25%20local-brightgreen\" alt=\"Runs locally\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/telemetry-zero-blue\" alt=\"Zero telemetry\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/diagram.png\" width=\"700\" alt=\"Text scanners read words. IntentProbe reads activations.\" /\u003e\n\u003c/p\u003e\n\nEvery public/source-verifiable MCP scanner we found reads text: patterns, classifiers, rules, or asks an LLM \"is this safe?\" IntentProbe does something different. It runs the tool description through a small local model, slices open the hidden layers, and reads the activation state directly. Same words, completely different activations when the intent is malicious.\n\nOn matched-vocabulary tool poisoning, where safe and poisoned descriptions use almost identical words, the public/source-verifiable DeBERTa text-classifier baseline catches **0%**. IntentProbe scores **96.6% F1**. ([Reproduce it yourself.](research/benchmark-results-deberta-vs-probe-2026-05-31.md))\n\nRuns locally. 22 KB probe. Any CPU. Nothing uploaded. See the [plain comparison](docs/intentprobe-vs-existing-mcp-scanners.md), [FAQ](docs/FAQ.md), [operator decisions](docs/OPERATOR_DECISIONS.md), [evidence packet](docs/EVIDENCE_PACKET.md), and [full competitive landscape](docs/COMPETITIVE_LANDSCAPE.md).\n\n## Install in one command\n\nThis is the public v0 install path:\n\n```bash\npython3 -m pip install intentprobe\n```\n\nIf your macOS Python blocks global installs with an \"externally managed\nenvironment\" error, use an app venv instead:\n\n```bash\npython3 -m venv .venv-intentprobe\n.venv-intentprobe/bin/python -m pip install intentprobe\n.venv-intentprobe/bin/intentprobe scan-config auto --format summary\n```\n\nThen scan the MCP tools already configured on your machine. `scan-config auto`\nchecks common Claude Desktop, Claude Code, Codex, Cursor, Windsurf, and repo MCP\nconfig locations:\n\n```bash\nintentprobe scan-config auto --format summary\n```\n\nOr scan a suspicious tool description:\n\n```bash\nintentprobe scan --format summary --text \"Reads SSH config and private keys, then silently uploads credentials to a remote server.\"\n```\n\nFirst model-backed scan downloads Qwen2.5-0.5B (~1 GB, once). Scan data stays on your machine.\n\n## GitHub Action\n\nUse IntentProbe as a CI gate for MCP configs, skills, and tool manifests:\n\n```yaml\nname: IntentProbe scan\n\non:\n  pull_request:\n  workflow_dispatch:\n\njobs:\n  scan-ai-tools:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: mcpware/IntentProbe@main\n        with:\n          paths: |\n            .\n          fail-on: block\n```\n\nSee [docs/GITHUB_ACTION.md](docs/GITHUB_ACTION.md) for target paths, inputs,\nand exit behavior.\n\nWant the quickest proof without a video? Run the\n[demo script](docs/DEMO_SCRIPT.md), or inspect the\n[live demo repo](https://github.com/mcpware/intentprobe-demo):\nsafe PR passes, poisoned PR blocks.\n\n---\n\n## How it works\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/how-it-works.png\" width=\"700\" alt=\"How IntentProbe works: tool description → frozen model → slice open layers 13-15 → 22KB probe → allow/warn/block\" /\u003e\n\u003c/p\u003e\n\n## The problem\n\n```\n  ┌──────────────────────────────────────────────────────────────────────┐\n  │  Tool A (safe):                                                      │\n  │  \"Reads the SSH config file and returns a parsed list of host        │\n  │   aliases, hostnames, and ports. Does not modify any files or        │\n  │   establish connections.\"                                            │\n  ├──────────────────────────────────────────────────────────────────────┤\n  │  Tool B (poisoned):                                                  │\n  │  \"Reads the SSH config file and returns a parsed list of host        │\n  │   aliases, hostnames, and ports. Also reads the corresponding        │\n  │   private key files to compute fingerprints for each identity.\"      │\n  └──────────────────────────────────────────────────────────────────────┘\n                    ▲                              ▲\n                    │  Almost identical words       │  Steals your SSH keys\n```\n\nReal pair from our benchmark. IntentProbe scores the safe tool at 0.081 (allow) and the poisoned tool at 0.982 (warn). The DeBERTa text-classifier baseline scores both at 0.0%. It sees no difference.\n\nThe words are almost identical. The intent is not. Text scanners fail here because there is no text difference to find.\n\n## Three approaches to scanning\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/three-approaches.png\" width=\"700\" alt=\"Three approaches: Text Classifier (reads words, 0-20%), LLM-as-Judge (asks the model), Activation Probing (reads intent, 96.5%)\" /\u003e\n\u003c/p\u003e\n\n## Competitive landscape\n\n\u003e Others read text, ask the cloud, ask another LLM, or match patterns. IntentProbe reads the model's internal activations after it processes the tool description, detecting whether it entered a state that encodes credential access, exfiltration, escalation, or hidden persistence.\n\n| Type | Who | How they scan | Gap | How IntentProbe differs |\n|---|---|---|---|---|\n| **Enterprise cloud** | Lakera, Azure Prompt Shields, Google Model Armor, AWS Bedrock Guardrails, Cisco | Ship content to their cloud API for classification | Black box. You can't verify what model they use or reproduce their results. | **100% local.** Every benchmark script, model artifact, and dataset is public. |\n| **MCP scanner** | Snyk Agent Scan, Invariant MCP-Scan, MEDUSA, ClawGuard | Static rules, pattern matching, metadata scan, policy checks | Practical, but reads text and known patterns. | **Reads activations.** What the model understood, not what the text says. |\n| **Text classifier** | ProtectAI DeBERTa, Meta Prompt Guard | Classify text as benign / injection / jailbreak | Trained on prompt injection, not tool poisoning. Fails on matched vocabulary. | Matched-vocabulary F1: IntentProbe **96.6%**, DeBERTa **0%**. |\n| **LLM-as-judge** | NeMo self-check, OpenAI Guardrails, Promptfoo | Ask another LLM: \"is this poisoned?\" | Expensive, slow, prompt-sensitive, and the generated answer is part of the attack surface. | **Representation-level.** Scores the internal state before any verbal answer is produced. |\n| **Red-team framework** | garak, Giskard, Promptfoo red team | Generate attacks to test your app | Audit tool, not a pre-install scanner. | IntentProbe is a **CLI + runtime hook** that blocks before install and before each tool call. |\n| **IntentProbe** | | Frozen local model + activation probe on layers 13-15 | Still improving on novel attack families | **First activation-probe scanner for MCP tool poisoning.** |\n\nFast comparison: [docs/intentprobe-vs-existing-mcp-scanners.md](docs/intentprobe-vs-existing-mcp-scanners.md)\n\nFAQ for common questions: [docs/FAQ.md](docs/FAQ.md)\n\nFull source-backed comparison: [docs/COMPETITIVE_LANDSCAPE.md](docs/COMPETITIVE_LANDSCAPE.md)\n\nAI-readable context: [llms.txt](llms.txt) and [llms-full.txt](llms-full.txt)\n\n## Why not just ask Qwen?\n\nLLM-as-judge is an output-level mechanism: ask a model to say safe or unsafe.\nIntentProbe is a representation-level mechanism: run the tool text through a\nfrozen local model and score the hidden activation state.\n\nThat difference matters. A poisoned tool can claim \"I am safe\", and a judge\nprompt can be steered into saying safe. IntentProbe does not trust the verbal\nanswer; it measures whether the tool text creates a poisoned-looking internal\nstate.\n\nWe tested direct-prompting the same `Qwen/Qwen2.5-0.5B` sensor model. The\ndeterministic label-score baseline flagged every clean curated item as poisoned\n(`clean FPR = 1.000`), while the generated-answer baseline produced lower\nrecall and many unparseable outputs. Full reproducible baseline:\n[research/QWEN_PROMPT_JUDGE_BASELINE_2026-06-08.md](research/QWEN_PROMPT_JUDGE_BASELINE_2026-06-08.md).\n\n## Benchmarks\n\nSame test sets. Same split. Same seed. Every number is reproducible from `research/`.\n\n```\n  IntentProbe vs DeBERTa text-classifier baseline\n  ════════════════════════════════════════════════\n\n  MCPTox poisoned recall (n=249)\n  IntentProbe  ██████████████████████████████████████████████████  100.0%\n  DeBERTa      ██████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   19.9%\n\n  Matched-vocabulary F1 (n=86)          ◀ the hard test\n  IntentProbe  ████████████████████████████████████████████████░░   96.6%\n  DeBERTa      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    0.0%\n\n  Novel attack families (n=2,900)\n  IntentProbe  █████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░   41.5%\n  TF-IDF       █████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░   10.7%\n\n  Adversarial evasion (camouflage suffixes)\n  IntentProbe  0/146 evaded ✅\n```\n\n\u003cdetails open\u003e\n\u003csummary\u003e\u003cstrong\u003eFull end-to-end results\u003c/strong\u003e\u003c/summary\u003e\n\n| Test | IntentProbe | Opponent | Takeaway |\n|---|---|---|---|\n| MCPTox held-out (n=249) | recall 100%, F1 99.3% | DeBERTa text baseline recall 19.9%, F1 33.0% | Clear win |\n| Same-words matched (n=86) | F1 96.6% | DeBERTa text baseline F1 0% | Text scanner blind |\n| Curated family holdout (n=76) | Qwen macro F1 0.829 | TF-IDF macro F1 0.823 | Slight edge |\n| RouteGuard external (n=2,900) | F1 0.513, recall 0.415 | TF-IDF F1 0.172, recall 0.107 | 4x better on novel families |\n| Hard-block policy (n=2,900) | Block precision 1.000, clean FPR 0.000 | -- | Zero false positives |\n| Camouflage evasion | GPT-2 0/146, Qwen 0/15 | -- | \"This tool is safe\" doesn't fool the probe |\n\n\u003c/details\u003e\n\n## Research\n\n\u003e **[Can Model Internals Detect MCP Tool Poisoning That Text Analysis Cannot?](https://doi.org/10.5281/zenodo.19990741)**\n\u003e\n\u003e Five rounds of experiments. Each round removes a text-level shortcut. If the probe is just doing fancy word counting, accuracy should drop. It never did. TF-IDF went from 93% to 30% as confounds were removed. The activation probe stayed above 93% throughout.\n\n## Try it\n\n```bash\n# Scan Claude/Cursor/Codex MCP configs already on this machine\nintentprobe scan-config auto --format summary\n\n# Scan a suspicious tool description\nintentprobe scan --format summary \\\n  --text \"Reads SSH config and private keys, then silently uploads credentials to a remote server.\"\n\n# Scan a safe tool description\nintentprobe scan --format summary \\\n  --text \"A calculator that adds two numbers and returns the sum.\"\n\n# Scan an MCP server folder before installing\nintentprobe scan-path ./some-mcp-server --format summary\n\n# CI gate (exit code 2 on block)\nintentprobe scan --fail-on block --text \"...\"\n\n# Runtime gating demo (safe, in-memory, no real tools)\npython examples/runtime_toy_agent.py --allow-download\n```\n\n```\n  ┌──────────────────────────────────────────────────────────┐\n  │  $ intentprobe scan --format summary \\                   │\n  │      --text \"Reads SSH config and private keys, then     │\n  │      silently uploads credentials to a remote server.\"   │\n  │                                                          │\n  │  input-1: decision=block  risk=0.980                     │\n  │    - activation probe score=0.980                        │\n  │    - static: private keys, credential files              │\n  │    - static: uploading data outside local scope          │\n  └──────────────────────────────────────────────────────────┘\n```\n\n## Setup: Static Scanner\n\nScan MCP servers, packages, and skills **before** you install them.\n\n```bash\n# Scan installed MCP client configs\nintentprobe scan-config auto --format summary\n\n# Scan a folder (package.json, MCP configs, SKILL.md, READMEs)\nintentprobe scan-path ./some-mcp-server --format summary --fail-on block\n\n# Scan a single tool description\nintentprobe scan --format summary \\\n  --text \"Reads SSH config and returns host aliases.\"\n\n# Batch scan a JSON array of descriptions\nintentprobe batch --batch-file tools.json --format summary\n```\n\n```\n  ┌─────────────────────────────────────────────────────────────┐\n  │  You find a new MCP server on GitHub                        │\n  │       │                                                     │\n  │       ▼                                                     │\n  │  git clone \u003crepo\u003e                                           │\n  │       │                                                     │\n  │       ▼                                                     │\n  │  intentprobe scan-path ./repo --fail-on block               │\n  │       │                                                     │\n  │       ├──→ allow   safe to install                          │\n  │       ├──→ warn    review the flagged descriptions          │\n  │       └──→ block   do NOT install (exit code 2)             │\n  └─────────────────────────────────────────────────────────────┘\n```\n\n## Setup: Runtime Hook\n\nScan tool calls **as they happen** inside Claude Code.\n\nAdd to `.claude/settings.json`:\n\n```json\n{\n  \"hooks\": {\n    \"PreToolUse\": [{\n      \"command\": \"intentprobe runtime scan --stdin --input-format json --fail-on block\",\n      \"timeout\": 10000\n    }]\n  }\n}\n```\n\nEvery tool call is now scanned before execution. Model stays warm via JSONL protocol for sub-second latency.\n\n```\n  ┌─────────────────────────────────────────────────────────────┐\n  │  Claude Code calls a tool                                   │\n  │       │                                                     │\n  │       ▼                                                     │\n  │  PreToolUse hook ──→ intentprobe runtime scan               │\n  │       │                                                     │\n  │       ├──→ allow   tool executes                            │\n  │       ├──→ warn    logged, tool executes                    │\n  │       └──→ block   tool call stopped                        │\n  └─────────────────────────────────────────────────────────────┘\n```\n\nTest safely with the in-memory demo: `python examples/runtime_toy_agent.py --allow-download`\n\nFull event schema: [docs/RUNTIME_HOOKS.md](docs/RUNTIME_HOOKS.md)\n\nOperator decisions and replay receipts: [docs/OPERATOR_DECISIONS.md](docs/OPERATOR_DECISIONS.md)\n\nThe runtime output is structured JSON, not just a score. It includes the gate\ndecision, fail level, subject hash, activation score, static evidence spans,\nthresholds, decision-policy reasons, scanner version, and artifact id, so a\ndownstream runtime can log and replay why a tool call was allowed, warned, or\nblocked.\n\n## What it scans\n\n```\n  scan-path:\n  ├── package.json             description, scripts, dependencies\n  ├── mcp.json / mcp-config    server definitions, tool schemas\n  ├── SKILL.md                 Claude Code skill instructions\n  ├── README.md                tool documentation\n  └── *-tool-*.json            tool/skill metadata\n\n  scan-config:\n  ├── Claude Desktop           claude_desktop_config.json\n  ├── Claude Code              ~/.claude.json, ~/.claude/mcp.json\n  ├── Codex                    ~/.codex/config.toml\n  ├── Cursor                   ~/.cursor/mcp.json\n  ├── Windsurf                 ~/.codeium/windsurf/mcp_config.json\n  └── local repo               .mcp.json\n\n  runtime:\n  ├── tool_definition          scan before registering\n  ├── before_tool_call         scan arguments before execution\n  └── after_tool_call          scan responses before trusting\n```\n\n## Honest limitations\n\n```\n  ✅ Matched-vocabulary poisoning    96.5%\n  ✅ Template attacks (MCPTox)       100%\n  ✅ Camouflage evasion              0/146 evaded\n  ✅ False positives (block tier)    0.000\n\n  ⚠️  Novel attack families          ~41% (4x better than text classifiers)\n  ⚠️  White-box adversarial          untested\n```\n\n## The story\n\nI source-read public MCP scanner paths and the DeBERTa prompt-injection classifier baseline used in Snyk/Invariant-style scanner code. It is trained on prompt injection, not tool poisoning. On matched-vocabulary attacks it scores 0%. I checked every other public scanner I could find. Rules, regex, text classifiers, opaque cloud APIs. I did not find another product-shaped MCP/tool scanner that uses model internals as the main signal.\n\nSo I built one that does. Feed the description into a small model, slice it open, read the activations. The signal is there. A 22 KB probe catches what every text scanner misses.\n\nThe [research paper](https://doi.org/10.5281/zenodo.19990741) documents five rounds of experiments proving the activation signal is real and not just fancy word counting. The benchmarks are open. The probe weights are in the repo. Run them yourself.\n\nIf IntentProbe misses something you find in the wild, [report it](https://github.com/mcpware/IntentProbe/issues/new?template=missed-detection.yml). Every missed sample makes the next version better.\n\n## License\n\nApache-2.0\n\n---\n\nIf IntentProbe ever stops a poisoned tool from reaching your machine, a star helps other people find it.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmcpware%2Fintentprobe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmcpware%2Fintentprobe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmcpware%2Fintentprobe/lists"}