https://github.com/stackonehq/stackone-defender
https://github.com/stackonehq/stackone-defender
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/stackonehq/stackone-defender
- Owner: StackOneHQ
- Created: 2026-03-17T14:51:25.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-04-14T03:09:58.000Z (3 months ago)
- Last Synced: 2026-04-19T09:37:12.687Z (3 months ago)
- Language: Python
- Size: 43.2 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
---
Indirect prompt injection defense for AI agents using tool calls (MCP, CLI, or direct APIs). Detects and neutralizes attacks hidden in tool results (emails, documents, PRs, etc.) before they reach your LLM.
**Python package:** [`stackone-defender`](https://pypi.org/project/stackone-defender/) — aligned with [`@stackone/defender`](https://www.npmjs.com/package/@stackone/defender) on npm.
## Installation
**pip**
```bash
pip install stackone-defender
```
**uv**
```bash
uv add stackone-defender
```
**Tier 2 (ONNX)** — add extras:
```bash
pip install stackone-defender[onnx]
# or: uv add "stackone-defender[onnx]"
```
The ONNX model (~22MB) is bundled in the wheel — no extra downloads at runtime.
## Quick start
```python
from stackone_defender import create_prompt_defense
# Tier 1 + Tier 2 are on by default. block_high_risk=True enables allow/block.
defense = create_prompt_defense(block_high_risk=True)
# Optional: preload ONNX to avoid first-call latency (requires [onnx] extra)
defense.warmup_tier2()
result = defense.defend_tool_result(tool_output, "gmail_get_message")
if not result.allowed:
print(f"Blocked: risk={result.risk_level}, score={result.tier2_score}")
print(f"Detections: {', '.join(result.detections)}")
else:
send_to_llm(result.sanitized)
```
## How it works

`defend_tool_result()` runs two tiers:
### Tier 1 — Pattern detection (sync, ~1 ms)
- **Unicode normalization** — homoglyph resistance (e.g. Cyrillic `а` → ASCII `a`)
- **Role stripping** — `SYSTEM:`, `ASSISTANT:`, ``, `[INST]`, etc.
- **Pattern removal** — phrases like “ignore previous instructions”
- **Encoding detection** — suspicious Base64/URL-shaped payloads
- **Boundary annotation** — `[UD-{id}]…[/UD-{id}]` wrappers around untrusted spans
### Tier 2 — ML classification (ONNX)
Sentence-level MiniLM classifier (int8 ONNX ~22 MB, bundled):
- Split text into sentences, score each (0.0 = benign, 1.0 = injection-like), take the max
- Catches paraphrased or novel injections missed by regex
- Roughly ~10 ms per batch after warmup (CPU)
**Benchmarks** (F1 @ threshold 0.5):
| Benchmark | F1 | Samples |
|-----------|-----|--------|
| Qualifire (in-distribution) | 0.8686 | ~1.5k |
| xxz224 (out-of-distribution) | 0.8834 | ~22.5k |
| jayavibhav (adversarial) | 0.9717 | ~1k |
| **Average** | **0.9079** | ~25k |
### `allowed` vs `risk_level`
- Use **`allowed`** for gating when `block_high_risk=True`: `False` means do not pass `sanitized` to the model as-is.
- **`risk_level`** is diagnostic: it starts at `default_risk_level` (default `"medium"`) and is **escalated** by Tier 1 / Tier 2 signals — not reduced. Use it for logging, not as the sole block signal unless you implement your own policy.
| Level | Typical trigger |
|-------|------------------|
| `low` | No strong signals |
| `medium` | Lighter pattern / sanitization signals |
| `high` / `critical` | Strong injection patterns, encoding signals, or high Tier 2 score |
## API
### `create_prompt_defense(**kwargs)`
```python
defense = create_prompt_defense(
enable_tier1=True,
enable_tier2=True,
block_high_risk=False,
default_risk_level="medium",
tier2_fields=["subject", "body", "snippet"], # optional: scope Tier 2 to these JSON keys
config={
"tier2": {
"high_risk_threshold": 0.8,
"tier2_fields": None, # or list[str]; constructor tier2_fields wins if set
},
},
)
```
### `defense.defend_tool_result(value, tool_name)`
Runs Tier 1 sanitization on risky fields, then Tier 2 on extracted text (with optional field scoping). **Synchronous** — no `await`.
```python
@dataclass
class DefenseResult:
allowed: bool
risk_level: RiskLevel
sanitized: Any
detections: list[str]
fields_sanitized: list[str]
patterns_by_field: dict[str, list[str]]
tier2_score: float | None = None
tier2_skip_reason: str | None = None
max_sentence: str | None = None
latency_ms: float = 0.0
```
### `defense.defend_tool_results(items)`
```python
results = defense.defend_tool_results([
{"value": email_data, "tool_name": "gmail_get_message"},
{"value": doc_data, "tool_name": "documents_get"},
{"value": pr_data, "tool_name": "github_get_pull_request"},
])
for r in results:
if not r.allowed:
print("Blocked:", ", ".join(r.fields_sanitized))
```
### `defense.analyze(text)`
Tier 1 only — useful for debugging pattern hits without full tool-result traversal.
### Tier 2 warmup
```python
defense = create_prompt_defense()
defense.warmup_tier2() # no-op if enable_tier2=False or ONNX extra missing
```
## Integration example
```python
from stackone_defender import create_prompt_defense
defense = create_prompt_defense(block_high_risk=True)
defense.warmup_tier2()
def run_tool_and_defend(raw_result: dict, tool_name: str):
outcome = defense.defend_tool_result(raw_result, tool_name)
if not outcome.allowed:
return {"error": "Content blocked by safety filter", "risk_level": outcome.risk_level}
return outcome.sanitized
# Example agent loop
sanitized = run_tool_and_defend(gmail_api.get_message(msg_id), "gmail_get_message")
```
## Risky field detection
Only **string** values under configured “risky” keys are scanned and sanitized. [`RiskyFieldConfig`](https://github.com/StackOneHQ/stackone-defender/blob/main/src/stackone_defender/types.py) provides global names/patterns plus **`tool_overrides`** (wildcard tool names → field list), same idea as the npm package.
| Tool pattern | Scanned fields |
|--------------|----------------|
| `gmail_*`, `email_*` | subject, body, snippet, content |
| `documents_*` | name, description, content, title |
| `github_*` | name, title, body, description, message |
| `hris_*` | name, notes, bio, description |
| `ats_*` | name, notes, description, summary |
| `crm_*` | name, description, notes, content |
Otherwise the default list applies: `name`, `description`, `content`, `title`, `notes`, `summary`, `bio`, `body`, `text`, `message`, `comment`, `subject`, plus suffix patterns like `*_body`, `*_description`, etc. Structural keys such as `id`, `url`, `created_at` are not treated as risky by default.
## Development
```bash
uv sync --group dev
uv run pytest
```
## License
Apache-2.0 — see [LICENSE](./LICENSE).