https://github.com/tracebit-com/tracebit-canaries-skill
Agent skill to set up end-to-end security canary coverage using Tracebit Community Edition
https://github.com/tracebit-com/tracebit-canaries-skill
Last synced: about 1 month ago
JSON representation
Agent skill to set up end-to-end security canary coverage using Tracebit Community Edition
- Host: GitHub
- URL: https://github.com/tracebit-com/tracebit-canaries-skill
- Owner: tracebit-com
- Created: 2026-03-17T11:46:05.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-08T13:41:44.000Z (2 months ago)
- Last Synced: 2026-05-03T14:37:21.931Z (about 1 month ago)
- Language: Shell
- Homepage:
- Size: 181 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# tracebit-canaries
**Human-supervised prompt injection detection and incident response for AI agents, powered by Tracebit Community Edition canary tokens.**
A skill for [OpenClaw](https://openclaw.ai) that deploys canary tokens as a deception layer around your agent's environment — making invisible attacks visible.
---
## What This Does
This skill gives your OpenClaw agent end-to-end deception-based canary coverage — from zero to human-supervised threat detection in a single run.
Your agent will:
1. **Sign up** for a free Tracebit Community Edition account (using the browser tool)
2. **Install** the Tracebit CLI (SHA256-verified from official GitHub Releases, no elevated privileges) and authenticate via OAuth
3. **Deploy** five types of decoy canary tokens (via the open-source Tracebit CLI, with your explicit approval):
- AWS session credentials
- SSH private keys
- Browser session cookies
- Login credentials
- A monitored email address
4. **Configure** a heartbeat check that searches your inbox (read-only) every 30 minutes for Tracebit alert emails and triggers human-supervised incident response
5. **Test** the full pipeline and confirm it works end-to-end
From that point on, if anything uses a canary credential — a stolen key, an exfiltrated secret, a prompt injection that made your agent do something it shouldn't — you'll know within the next heartbeat cycle. Your agent investigates (read-only), sends you a structured report, and waits for your acknowledgement before taking any further action.
---
## Why Canaries?
Prompt injection is the #1 unsolved threat for AI agents. An attacker embeds instructions in a webpage, document, or API response your agent reads. The agent executes them. There's no error. No log. You never know it happened.
**Canary tokens solve the detection problem.** They don't prevent attacks — they make the invisible visible. A canary is a fake credential that looks real but does one thing: fires an alert the moment anything uses it.
If a canary fires, something read your agent's context and used what it found. That's the signal. Everything before that moment in your agent's history is the evidence.
Traditional defenses (content labeling, trust-tagging, input classification) catch the attacks that look like attacks. Canaries catch the ones that don't.
> In a 48-hour experiment with three different injection attempts, standard defenses caught 0. Canaries caught all 3.
---
## The Alert Response Loop
When a canary fires:
```
Canary used → Tracebit alert email → heartbeat inbox check (every ~30 min)
→ agent detects alert, notifies human immediately: "🚨 canary triggered, investigating"
→ read-only investigation (context review, indicator scan, severity assessment)
→ structured report to human with findings and recommendations
→ fresh canaries deployed after human acknowledgement
```
The agent detects the alert on its next heartbeat, works the problem (read-only), and reports back. Canary rotation only happens after you confirm.
---
## Attack Patterns Detected
| Attack | How It Works | Why Standard Defenses Miss It | Canary Detection |
|--------|-------------|-------------------------------|-----------------|
| **Behavior exploitation** | URL in a JSON `next_step` field — agent follows by trained habit, no explicit instruction | No injection keywords, looks like legitimate data | Canary URL fires on access |
| **Context pollution** | Canary credential appears in agent-generated code as a "placeholder" — model pattern-matches context into output | No injection at all, just leakage | Canary string appears in output |
| **Trust score gaming** | Malicious instructions framed as legitimate agent-to-agent communication | Classifiers trained on explicit injection won't flag it | Canary fires at exfiltration |
| **Prompt injection via role confusion** | Malicious instructions hidden in an email or external content, disguised as a system error or remediation step — agent executes them because untrusted content lands in a trusted role | Boundary markers rely on pattern matching; a single typo or character substitution bypasses them | Canary fires when injected instructions cause credential use or outbound calls |
| **Stealth exfiltration** | Credential stolen early, used days or weeks later | Attack happened long before detection | Canary fires whenever credential is used, regardless of when stolen |
---
## What's Inside
```
tracebit-canaries/
├── SKILL.md # Agent instructions (loaded on activation)
├── scripts/
│ ├── install-tracebit.sh # OS/arch-aware CLI installer (SHA256-verified)
│ ├── check-canaries.sh # Show canary status and expiry
│ ├── test-canary.sh # Trigger a test alert
│ └── parse-tracebit-alert.sh # Parse alert emails into structured JSON
├── references/
│ ├── incident-response-playbook.md # Full 5-phase IR procedure
│ ├── attack-patterns.md # Real-world patterns with mitigations
│ ├── canary-types.md # Each canary type: what it detects, where to place it
│ ├── security-compliance.md # Safety posture, file traceability, enforcement model, full removal
│ ├── api-reference.md # API-based deployment (fallback if CLI unavailable)
│ └── troubleshooting.md # Common issues and fixes
└── assets/
└── canary-config.json # Deployment config reference
```
---
## Security & Transparency
This skill is user-initiated and runs under user supervision. The user can interrupt or cancel at any step. Here's what it does and does not do:
| Concern | What actually happens |
|---------|----------------------|
| **Credentials deployed** | All are **fake decoy canary tokens** — they grant no access to any real system. Their sole purpose is to alert when used. |
| **Real credentials** | **Never read or modified.** The Tracebit CLI places canary tokens in standard credential locations (separate from existing credentials). |
| **Email access** | **Read-only** via the user's pre-authorized email tool. Used only for confirmation codes and alert email detection. No emails are sent, deleted, or modified. |
| **CLI installation** | Open-source Tracebit CLI from a [pinned GitHub release](https://github.com/tracebit-com/tracebit-community-cli). SHA256 checksum verification is mandatory and cannot be bypassed. No elevated privileges — macOS uses the standard system installer dialog. |
| **Signup password** | **Never shown in conversation output.** Written to a temp file with `600` permissions; user is instructed to reset it and delete the file. |
| **Network access** | Only contacts: `community.tracebit.com` (account/canary management) and `github.com` (one-time CLI download). No telemetry, no third-party endpoints. |
| **Human oversight** | The agent handles mechanical steps (form filling, CLI commands) so the user doesn't have to. Canary deployment and rotation require explicit human confirmation. Investigation is read-only. Memory file reads require human permission. |
| **Privileges** | Runs as the current user. No elevated privileges used by the skill or install script. |
| **Background service** | The Tracebit CLI daemon refreshes canary token expiry only — no other network calls or file access. Runs as current user, fully removable. |
For full details — including file traceability, enforcement model, and complete removal instructions — see `references/security-compliance.md`.
---
## Requirements
- **Email access** — a pre-authorized email account configured in OpenClaw (read-only inbox access for confirmation codes and alert detection)
- **OpenClaw** with a messaging channel configured (for canary alert notifications to the user)
The Tracebit CLI is downloaded and installed automatically by the skill from [its open-source repository](https://github.com/tracebit-com/tracebit-community-cli). Standard tools (`curl`, `python3`, `jq`) are used by the scripts but are available on any typical system.
---
## Usage
Just ask your agent:
> "Set up Tracebit canaries"
or
> "Deploy security canaries on this machine"
The agent handles the mechanical steps from there — account creation through to a working alert pipeline. It will ask for your confirmation before deploying canary tokens and may ask for help with a CAPTCHA if one appears.
---
## After Setup
The Tracebit CLI runs a **background service** that automatically refreshes canary credentials before they expire. You don't need to do anything — credentials stay fresh indefinitely.
Add a weekly check to your agent's heartbeat:
```markdown
## Canary Check (weekly)
- Run: tracebit show
- If any expired: tracebit deploy all && tracebit deploy email
```
---
## Tracebit Community Edition
Free forever. No credit card required.
- Signup: [community.tracebit.com](https://community.tracebit.com)
- CLI source: [github.com/tracebit-com/tracebit-community-cli](https://github.com/tracebit-com/tracebit-community-cli)
- Supports: AWS credentials, SSH keys, browser cookies, login credentials, email canaries
---
## License
MIT