An open API service indexing awesome lists of open source software.

https://github.com/stackonehq/stackone-defender


https://github.com/stackonehq/stackone-defender

Last synced: 2 months ago
JSON representation

Awesome Lists containing this project

README

          



Defender by StackOne — Indirect prompt injection protection for MCP tool calls


PyPI version
latest GitHub release
GitHub stars
License
Python 3.11+



Model size: 22MB
Latency: ~10ms
CPU only
F1 Score: 90.8%

---

Indirect prompt injection defense for AI agents using tool calls (MCP, CLI, or direct APIs). Detects and neutralizes attacks hidden in tool results (emails, documents, PRs, etc.) before they reach your LLM.

**Python package:** [`stackone-defender`](https://pypi.org/project/stackone-defender/) — aligned with [`@stackone/defender`](https://www.npmjs.com/package/@stackone/defender) on npm.

## Installation

**pip**

```bash
pip install stackone-defender
```

**uv**

```bash
uv add stackone-defender
```

**Tier 2 (ONNX)** — add extras:

```bash
pip install stackone-defender[onnx]
# or: uv add "stackone-defender[onnx]"
```

The ONNX model (~22MB) is bundled in the wheel — no extra downloads at runtime.

## Quick start

```python
from stackone_defender import create_prompt_defense

# Tier 1 + Tier 2 are on by default. block_high_risk=True enables allow/block.
defense = create_prompt_defense(block_high_risk=True)

# Optional: preload ONNX to avoid first-call latency (requires [onnx] extra)
defense.warmup_tier2()

result = defense.defend_tool_result(tool_output, "gmail_get_message")

if not result.allowed:
print(f"Blocked: risk={result.risk_level}, score={result.tier2_score}")
print(f"Detections: {', '.join(result.detections)}")
else:
send_to_llm(result.sanitized)
```

## How it works


Defender flow: poisoned tool output is sanitized and evaluated; high-risk content can be blocked before the LLM

`defend_tool_result()` runs two tiers:

### Tier 1 — Pattern detection (sync, ~1 ms)

- **Unicode normalization** — homoglyph resistance (e.g. Cyrillic `а` → ASCII `a`)
- **Role stripping** — `SYSTEM:`, `ASSISTANT:`, ``, `[INST]`, etc.
- **Pattern removal** — phrases like “ignore previous instructions”
- **Encoding detection** — suspicious Base64/URL-shaped payloads
- **Boundary annotation** — `[UD-{id}]…[/UD-{id}]` wrappers around untrusted spans

### Tier 2 — ML classification (ONNX)

Sentence-level MiniLM classifier (int8 ONNX ~22 MB, bundled):

- Split text into sentences, score each (0.0 = benign, 1.0 = injection-like), take the max
- Catches paraphrased or novel injections missed by regex
- Roughly ~10 ms per batch after warmup (CPU)

**Benchmarks** (F1 @ threshold 0.5):

| Benchmark | F1 | Samples |
|-----------|-----|--------|
| Qualifire (in-distribution) | 0.8686 | ~1.5k |
| xxz224 (out-of-distribution) | 0.8834 | ~22.5k |
| jayavibhav (adversarial) | 0.9717 | ~1k |
| **Average** | **0.9079** | ~25k |

### `allowed` vs `risk_level`

- Use **`allowed`** for gating when `block_high_risk=True`: `False` means do not pass `sanitized` to the model as-is.
- **`risk_level`** is diagnostic: it starts at `default_risk_level` (default `"medium"`) and is **escalated** by Tier 1 / Tier 2 signals — not reduced. Use it for logging, not as the sole block signal unless you implement your own policy.

| Level | Typical trigger |
|-------|------------------|
| `low` | No strong signals |
| `medium` | Lighter pattern / sanitization signals |
| `high` / `critical` | Strong injection patterns, encoding signals, or high Tier 2 score |

## API

### `create_prompt_defense(**kwargs)`

```python
defense = create_prompt_defense(
enable_tier1=True,
enable_tier2=True,
block_high_risk=False,
default_risk_level="medium",
tier2_fields=["subject", "body", "snippet"], # optional: scope Tier 2 to these JSON keys
config={
"tier2": {
"high_risk_threshold": 0.8,
"tier2_fields": None, # or list[str]; constructor tier2_fields wins if set
},
},
)
```

### `defense.defend_tool_result(value, tool_name)`

Runs Tier 1 sanitization on risky fields, then Tier 2 on extracted text (with optional field scoping). **Synchronous** — no `await`.

```python
@dataclass
class DefenseResult:
allowed: bool
risk_level: RiskLevel
sanitized: Any
detections: list[str]
fields_sanitized: list[str]
patterns_by_field: dict[str, list[str]]
tier2_score: float | None = None
tier2_skip_reason: str | None = None
max_sentence: str | None = None
latency_ms: float = 0.0
```

### `defense.defend_tool_results(items)`

```python
results = defense.defend_tool_results([
{"value": email_data, "tool_name": "gmail_get_message"},
{"value": doc_data, "tool_name": "documents_get"},
{"value": pr_data, "tool_name": "github_get_pull_request"},
])
for r in results:
if not r.allowed:
print("Blocked:", ", ".join(r.fields_sanitized))
```

### `defense.analyze(text)`

Tier 1 only — useful for debugging pattern hits without full tool-result traversal.

### Tier 2 warmup

```python
defense = create_prompt_defense()
defense.warmup_tier2() # no-op if enable_tier2=False or ONNX extra missing
```

## Integration example

```python
from stackone_defender import create_prompt_defense

defense = create_prompt_defense(block_high_risk=True)
defense.warmup_tier2()

def run_tool_and_defend(raw_result: dict, tool_name: str):
outcome = defense.defend_tool_result(raw_result, tool_name)
if not outcome.allowed:
return {"error": "Content blocked by safety filter", "risk_level": outcome.risk_level}
return outcome.sanitized

# Example agent loop
sanitized = run_tool_and_defend(gmail_api.get_message(msg_id), "gmail_get_message")
```

## Risky field detection

Only **string** values under configured “risky” keys are scanned and sanitized. [`RiskyFieldConfig`](https://github.com/StackOneHQ/stackone-defender/blob/main/src/stackone_defender/types.py) provides global names/patterns plus **`tool_overrides`** (wildcard tool names → field list), same idea as the npm package.

| Tool pattern | Scanned fields |
|--------------|----------------|
| `gmail_*`, `email_*` | subject, body, snippet, content |
| `documents_*` | name, description, content, title |
| `github_*` | name, title, body, description, message |
| `hris_*` | name, notes, bio, description |
| `ats_*` | name, notes, description, summary |
| `crm_*` | name, description, notes, content |

Otherwise the default list applies: `name`, `description`, `content`, `title`, `notes`, `summary`, `bio`, `body`, `text`, `message`, `comment`, `subject`, plus suffix patterns like `*_body`, `*_description`, etc. Structural keys such as `id`, `url`, `created_at` are not treated as risky by default.

## Development

```bash
uv sync --group dev
uv run pytest
```

## License

Apache-2.0 — see [LICENSE](./LICENSE).