https://github.com/mnemom/aip
Agent Integrity Protocol — real-time thinking block analysis for AI agent alignment
https://github.com/mnemom/aip
agent ai alignment integrity llm protocol safety thinking
Last synced: 2 months ago
JSON representation
Agent Integrity Protocol — real-time thinking block analysis for AI agent alignment
- Host: GitHub
- URL: https://github.com/mnemom/aip
- Owner: mnemom
- License: apache-2.0
- Created: 2026-02-10T02:53:20.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-02-14T05:13:59.000Z (2 months ago)
- Last Synced: 2026-02-15T03:27:06.958Z (2 months ago)
- Topics: agent, ai, alignment, integrity, llm, protocol, safety, thinking
- Language: Python
- Size: 434 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: docs/SECURITY.md
Awesome Lists containing this project
README
# Agent Integrity Protocol (AIP)
[](https://github.com/mnemom/aip/actions/workflows/ci.yml)
[](https://github.com/mnemom/aip/actions/workflows/codeql.yml)
[](https://pypi.org/project/agent-integrity-proto/)
[](https://www.npmjs.com/package/@mnemom/agent-integrity-protocol)
[](LICENSE)
[](docs/SPEC.md)
**Real-time thinking block analysis for AI agent alignment.**
AIP analyzes what an agent is *thinking* before it acts. It extracts thinking blocks from LLM responses, evaluates them against an Alignment Card using an analysis LLM, and delivers integrity verdicts in real-time — enabling intervention between turns, not after the damage is done.
> AIP follows the daimonion philosophy: silence means aligned, voice means outside boundaries. It is a sister protocol to [AAP](https://github.com/mnemom/aap) — sharing the same Alignment Card, enriching the same AP-Trace, but operating on a fundamentally different timescale: real-time rather than retroactive.
## Quick Start
```bash
# Install
pip install agent-integrity-proto
# or
npm install @mnemom/agent-integrity-protocol
```
```python
from aip import create_client
client = create_client(
base_url="https://gateway.example.com",
agent_id="my-agent",
shared_secret="your-secret",
)
# Check a thinking block against an alignment card
signal = client.check(
thinking_block="I should help the user find the best product...",
alignment_card=card,
)
if signal.proceed:
# Verdict was 'clear' — continue
execute_action()
else:
# Verdict was 'review_needed' or 'boundary_violation'
escalate(signal.checkpoint.concerns)
```
```typescript
import { createClient } from '@mnemom/agent-integrity-protocol';
const client = createClient({
baseUrl: 'https://gateway.example.com',
agentId: 'my-agent',
sharedSecret: 'your-secret',
});
const signal = await client.check({
thinkingBlock: 'I should help the user find the best product...',
alignmentCard: card,
});
if (signal.proceed) {
executeAction();
} else {
escalate(signal.checkpoint.concerns);
}
```
## Why AIP?
AAP provides post-hoc transparency — every agent action is traced, verified, and checked for drift. But verification happens *after* the action, typically 30-60 seconds later. This creates a window of vulnerability:
- **Prompt injection**: A malicious prompt causes data exfiltration. The trace records it, but the data is already gone.
- **Gradual drift**: Reasoning shifts subtly across turns. Each action passes verification, but the cumulative direction is misaligned.
- **Misaligned intent**: The thinking block reveals a forbidden action, but it's only examined retroactively.
All three failure modes are visible in thinking blocks *before* the action lands. AIP fills this gap.
| Protocol | Function | Timescale |
|----------|----------|-----------|
| **AAP** | Post-hoc verification via AP-Traces | Retroactive (30-60s) |
| **AIP** | Real-time thinking block analysis | Between turns (~2s) |
| **Together** | Full coverage: real-time detection + audit trail | Both |
## How It Works
```
Agent Turn N
│
▼
┌──────────────────────────────────────────────────────┐
│ Layer 1: Thinking Block Extraction │
│ │
│ Anthropic │ OpenAI │ Google │ Fallback │
│ (thinking) │ (reason) │ (think) │ (regex) │
└──────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Layer 2: Integrity Analysis │
│ │
│ Card Summary + Conscience Values + Session Window │
│ → Analysis LLM (e.g. Haiku) │
│ → Integrity Verdict │
└──────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Layer 3: Signal Delivery │
│ │
│ clear → proceed │ review_needed → warn/pause │
│ │ boundary_violation → block │
└──────────────────────────────────────────────────────┘
│
▼
Agent Turn N+1
```
### Integrity Checkpoint
The core output — a structured record of each thinking block evaluation:
```json
{
"checkpoint_id": "ic-f47ac10b",
"agent_id": "my-agent",
"verdict": "clear",
"confidence": 0.95,
"concerns": [],
"thinking_hash": "sha256:a1b2c3...",
"window_position": { "index": 3, "window_size": 10 },
"analysis_metadata": {
"model": "claude-haiku-4-5-20251001",
"latency_ms": 1847,
"conscience_values_used": 2
}
}
```
### Conscience Prompt
AIP builds a system prompt for the analysis LLM from:
- **Card summary** — compact representation of the Alignment Card
- **Conscience values** — `BOUNDARY` (hard limits) and `FEAR` (concerns) types injected as evaluation criteria
- **Session window** — recent checkpoint history for drift context
- **The thinking block itself**
The analysis LLM acts as a daimonion — it never commands, only warns.
### Provider Adapters
Thinking blocks differ by provider. AIP abstracts extraction:
| Provider | Source | Confidence |
|----------|--------|------------|
| Anthropic | `thinking` content blocks (+ SSE stream fallback) | 1.0 |
| OpenAI | `reasoning_content` field (+ SSE stream fallback) | 0.9 |
| Google | Gemini thinking blocks (+ SSE stream fallback) | 0.9 |
| Fallback | Regex-based extraction | 0.3 |
All adapters attempt SSE stream extraction as a fallback when standard JSON parsing fails (v0.1.3+). This allows analysis of streamed responses captured by gateway proxies.
## What AIP Does Not Do
This matters. Read it.
1. **AIP does NOT prevent injection — it detects it.** AIP identifies injection patterns in thinking blocks. It does not prevent injected content from reaching the model.
2. **AIP does NOT interrupt streams.** AIP operates between turns. It does not cancel in-flight streaming responses.
3. **AIP does NOT replace AAP.** Post-hoc verification, trace storage, and public transparency remain AAP's domain. AIP supplements AAP with real-time detection.
4. **LLM-as-judge has inherent limits.** The analysis LLM can be fooled by sophisticated adversarial content. AIP reduces the attack surface but does not eliminate it.
5. **Thinking blocks are model-dependent.** Not all models expose thinking. Models that don't expose thinking blocks cannot be analyzed by AIP.
For the complete limitations disclosure, see [Section 14 of the Specification](docs/SPEC.md#14-limitations).
## Installation
```bash
# Python
pip install agent-integrity-proto
# TypeScript
npm install @mnemom/agent-integrity-protocol
```
**Requirements:** Python >= 3.10 | Node.js >= 18.0.0
## API Reference
### Python
```python
# Core analysis
from aip import (
check_integrity, # Evaluate thinking block → IntegrityCheckpoint
build_signal, # Construct signal from checkpoint → IntegritySignal
build_conscience_prompt, # Generate analysis LLM prompt
hash_thinking_block, # Content-addressed thinking reference
detect_integrity_drift, # Track behavioral drift across checkpoints
validate_agreement, # Verify card-conscience alignment
)
# Provider adapters
from aip import (
AnthropicAdapter, # Anthropic thinking content blocks
OpenAIAdapter, # OpenAI reasoning_content
GoogleAdapter, # Google Gemini thinking
FallbackAdapter, # Regex-based fallback
AdapterRegistry, # Dynamic provider selection
)
# SDK client
from aip import create_client, sign_payload, verify_signature
# Session state
from aip import WindowManager, create_window_state
```
### TypeScript
```typescript
import {
// Core analysis
checkIntegrity,
buildSignal,
buildConsciencePrompt,
hashThinkingBlock,
detectIntegrityDrift,
validateAgreement,
// Provider adapters
AnthropicAdapter,
OpenAIAdapter,
GoogleAdapter,
FallbackAdapter,
AdapterRegistry,
// SDK client
createClient,
signPayload,
verifySignature,
// Session state
WindowManager,
createWindowState,
} from '@mnemom/agent-integrity-protocol';
```
## Documentation
| Document | Description |
|----------|-------------|
| [**SPEC.md**](docs/SPEC.md) | Full protocol specification (IETF-style, 2,214 lines) |
| [**QUICKSTART.md**](docs/QUICKSTART.md) | Zero to integrity checking in 5 minutes |
| [**LIMITS.md**](docs/LIMITS.md) | What AIP guarantees and doesn't |
| [**SECURITY.md**](docs/SECURITY.md) | Threat model and security considerations |
| [**CHANGELOG.md**](CHANGELOG.md) | Release history |
## Examples
| Example | Description |
|---------|-------------|
| [`basic-check/`](examples/basic-check/) | Minimal integrity check with aligned and misaligned thinking |
| [`gateway-integration/`](examples/gateway-integration/) | Cloudflare Worker gateway with real-time AIP analysis |
| [`adversarial/`](examples/adversarial/) | Attack scenarios: injection, drift, meta-injection, deception |
## Status
**Current Version**: 0.1.3
| Component | Status |
|-----------|--------|
| Specification | ✅ Complete |
| TypeScript SDK | ✅ Complete (272 tests) |
| Python SDK | ✅ Complete (267 tests) |
| Provider Adapters | ✅ Anthropic, OpenAI, Google, Fallback |
| Session Windowing | ✅ Complete |
| Drift Detection | ✅ Complete |
| Gateway Integration | ✅ Verified (Cloudflare Workers) |
## Contributing
We welcome contributions. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
Key areas where we need help:
- Provider adapter implementations for additional LLMs
- Integration examples with agent frameworks
- Adversarial test vectors
- Documentation improvements
## License
Apache 2.0. See [LICENSE](LICENSE) for details.
---
*Agent Integrity Protocol is part of the [Mnemom.ai](https://github.com/mnemom) trust infrastructure for autonomous agents, alongside [AAP](https://github.com/mnemom/aap) (Agent Alignment Protocol).*