https://github.com/stackonehq/stackone-defender

Last synced: 2 months ago
JSON representation
Host: GitHub
URL: https://github.com/stackonehq/stackone-defender
Owner: StackOneHQ
Created: 2026-03-17T14:51:25.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-04-14T03:09:58.000Z (3 months ago)
Last Synced: 2026-04-19T09:37:12.687Z (3 months ago)
Language: Python
Size: 43.2 MB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project

README

          


  

    

    

  

  


    

    

    

    

    

  

  

    

    

    

    

  




---

Indirect prompt injection defense for AI agents using tool calls (MCP, CLI, or direct APIs). Detects and neutralizes attacks hidden in tool results (emails, documents, PRs, etc.) before they reach your LLM.

**Python package:** [`stackone-defender`](https://pypi.org/project/stackone-defender/) — aligned with [`@stackone/defender`](https://www.npmjs.com/package/@stackone/defender) on npm.

## Installation

**pip**

```bash

pip install stackone-defender

```

**uv**

```bash

uv add stackone-defender

```

**Tier 2 (ONNX)** — add extras:

```bash

pip install stackone-defender[onnx]

# or: uv add "stackone-defender[onnx]"

```

The ONNX model (~22MB) is bundled in the wheel — no extra downloads at runtime.

## Quick start

```python

from stackone_defender import create_prompt_defense

# Tier 1 + Tier 2 are on by default. block_high_risk=True enables allow/block.

defense = create_prompt_defense(block_high_risk=True)

# Optional: preload ONNX to avoid first-call latency (requires [onnx] extra)

defense.warmup_tier2()

result = defense.defend_tool_result(tool_output, "gmail_get_message")

if not result.allowed:

    print(f"Blocked: risk={result.risk_level}, score={result.tier2_score}")

    print(f"Detections: {', '.join(result.detections)}")

else:

    send_to_llm(result.sanitized)

```

## How it works

  

  

`defend_tool_result()` runs two tiers:

### Tier 1 — Pattern detection (sync, ~1 ms)

- **Unicode normalization** — homoglyph resistance (e.g. Cyrillic `а` → ASCII `a`)

- **Role stripping** — `SYSTEM:`, `ASSISTANT:`, ``, `[INST]`, etc.

- **Pattern removal** — phrases like “ignore previous instructions”

- **Encoding detection** — suspicious Base64/URL-shaped payloads

- **Boundary annotation** — `[UD-{id}]…[/UD-{id}]` wrappers around untrusted spans

### Tier 2 — ML classification (ONNX)

Sentence-level MiniLM classifier (int8 ONNX ~22 MB, bundled):

- Split text into sentences, score each (0.0 = benign, 1.0 = injection-like), take the max

- Catches paraphrased or novel injections missed by regex

- Roughly ~10 ms per batch after warmup (CPU)

**Benchmarks** (F1 @ threshold 0.5):

| Benchmark | F1 | Samples |

|-----------|-----|--------|

| Qualifire (in-distribution) | 0.8686 | ~1.5k |

| xxz224 (out-of-distribution) | 0.8834 | ~22.5k |

| jayavibhav (adversarial) | 0.9717 | ~1k |

| **Average** | **0.9079** | ~25k |

### `allowed` vs `risk_level`

- Use **`allowed`** for gating when `block_high_risk=True`: `False` means do not pass `sanitized` to the model as-is.

- **`risk_level`** is diagnostic: it starts at `default_risk_level` (default `"medium"`) and is **escalated** by Tier 1 / Tier 2 signals — not reduced. Use it for logging, not as the sole block signal unless you implement your own policy.

| Level | Typical trigger |

|-------|------------------|

| `low` | No strong signals |

| `medium` | Lighter pattern / sanitization signals |

| `high` / `critical` | Strong injection patterns, encoding signals, or high Tier 2 score |

## API

### `create_prompt_defense(**kwargs)`

```python

defense = create_prompt_defense(

    enable_tier1=True,

    enable_tier2=True,

    block_high_risk=False,

    default_risk_level="medium",

    tier2_fields=["subject", "body", "snippet"],  # optional: scope Tier 2 to these JSON keys

    config={

        "tier2": {

            "high_risk_threshold": 0.8,

            "tier2_fields": None,  # or list[str]; constructor tier2_fields wins if set

        },

    },

)

```

### `defense.defend_tool_result(value, tool_name)`

Runs Tier 1 sanitization on risky fields, then Tier 2 on extracted text (with optional field scoping). **Synchronous** — no `await`.

```python

@dataclass

class DefenseResult:

    allowed: bool

    risk_level: RiskLevel

    sanitized: Any

    detections: list[str]

    fields_sanitized: list[str]

    patterns_by_field: dict[str, list[str]]

    tier2_score: float | None = None

    tier2_skip_reason: str | None = None

    max_sentence: str | None = None

    latency_ms: float = 0.0

```

### `defense.defend_tool_results(items)`

```python

results = defense.defend_tool_results([

    {"value": email_data, "tool_name": "gmail_get_message"},

    {"value": doc_data, "tool_name": "documents_get"},

    {"value": pr_data, "tool_name": "github_get_pull_request"},

])

for r in results:

    if not r.allowed:

        print("Blocked:", ", ".join(r.fields_sanitized))

```

### `defense.analyze(text)`

Tier 1 only — useful for debugging pattern hits without full tool-result traversal.

### Tier 2 warmup

```python

defense = create_prompt_defense()

defense.warmup_tier2()  # no-op if enable_tier2=False or ONNX extra missing

```

## Integration example

```python

from stackone_defender import create_prompt_defense

defense = create_prompt_defense(block_high_risk=True)

defense.warmup_tier2()

def run_tool_and_defend(raw_result: dict, tool_name: str):

    outcome = defense.defend_tool_result(raw_result, tool_name)

    if not outcome.allowed:

        return {"error": "Content blocked by safety filter", "risk_level": outcome.risk_level}

    return outcome.sanitized

# Example agent loop

sanitized = run_tool_and_defend(gmail_api.get_message(msg_id), "gmail_get_message")

```

## Risky field detection

Only **string** values under configured “risky” keys are scanned and sanitized. [`RiskyFieldConfig`](https://github.com/StackOneHQ/stackone-defender/blob/main/src/stackone_defender/types.py) provides global names/patterns plus **`tool_overrides`** (wildcard tool names → field list), same idea as the npm package.

| Tool pattern | Scanned fields |

|--------------|----------------|

| `gmail_*`, `email_*` | subject, body, snippet, content |

| `documents_*` | name, description, content, title |

| `github_*` | name, title, body, description, message |

| `hris_*` | name, notes, bio, description |

| `ats_*` | name, notes, description, summary |

| `crm_*` | name, description, notes, content |

Otherwise the default list applies: `name`, `description`, `content`, `title`, `notes`, `summary`, `bio`, `body`, `text`, `message`, `comment`, `subject`, plus suffix patterns like `*_body`, `*_description`, etc. Structural keys such as `id`, `url`, `created_at` are not treated as risky by default.

## Development

```bash

uv sync --group dev

uv run pytest

```

## License

Apache-2.0 — see [LICENSE](./LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stackonehq/stackone-defender

Awesome Lists containing this project

README