https://github.com/amiable-dev/llm-council
The LLM Council works together to answer your hardest questions
https://github.com/amiable-dev/llm-council
llm-agents llm-as-a-judge llm-council llm-tools mcp mcp-server
Last synced: 19 days ago
JSON representation
The LLM Council works together to answer your hardest questions
- Host: GitHub
- URL: https://github.com/amiable-dev/llm-council
- Owner: amiable-dev
- License: mit
- Created: 2025-11-29T08:40:17.000Z (7 months ago)
- Default Branch: master
- Last Pushed: 2026-05-26T22:40:08.000Z (27 days ago)
- Last Synced: 2026-05-27T00:14:52.550Z (27 days ago)
- Topics: llm-agents, llm-as-a-judge, llm-council, llm-tools, mcp, mcp-server
- Language: Python
- Homepage: https://llm-council.dev/
- Size: 2.09 MB
- Stars: 26
- Watchers: 2
- Forks: 7
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Support: SUPPORT.md
- Governance: GOVERNANCE.md
Awesome Lists containing this project
README
LLM Council Core
A multi-LLM deliberation system where multiple LLMs collaboratively answer questions through peer review and synthesis. Available as a Python library, MCP server, or HTTP API.
## What is This?
Instead of asking a single LLM for answers, this MCP server:
1. **Stage 1**: Sends your question to multiple LLMs in parallel (GPT, Claude, Gemini, Grok, etc.)
2. **Stage 2**: Each LLM reviews and ranks the other responses (anonymized to prevent bias)
3. **Stage 3**: A Chairman LLM synthesizes all responses into a final, high-quality answer
## Quick Deploy
Deploy your own LLM Council instance:
| Platform | Deploy | Best For |
|----------|--------|----------|
| **Railway** | [](https://railway.com/deploy/llm-council?referralCode=K9dsYj) | Production, webhooks |
| **Render** | [](https://render.com/deploy?repo=https://github.com/amiable-dev/llm-council) | Evaluation, free tier |
**Required Environment Variables:**
- `OPENROUTER_API_KEY` - Your [OpenRouter](https://openrouter.ai) API key
- `LLM_COUNCIL_API_TOKEN` - A secure token for API authentication (generate with `openssl rand -hex 16`)
> **Note**: Railway is recommended for [n8n integration](https://llm-council.dev/integrations/n8n/) (no cold-start). Render Free tier spins down after 15 minutes which may cause webhook timeouts.
For detailed deployment instructions, see the [Deployment Guide](https://llm-council.dev/deployment/).
## Credits & Attribution
This project is a derivative work based on the original [llm-council](https://github.com/karpathy/llm-council) by Andrej Karpathy.
Karpathy's original README stated:
> "I'm not going to support it in any way, it's provided here as is for other people's inspiration and I don't intend to improve it. Code is ephemeral now and libraries are over, ask your LLM to change it in whatever way you like."
...the irony of producing a derivative work that packages the core concept for broader use via the Model Context Protocol!
## Installation
```bash
pip install "llm-council-core[mcp]"
```
For core library only (no MCP server):
```bash
pip install llm-council-core
```
## Setup
### 1. Choose Your Gateway
The council supports three gateway options for accessing LLMs:
| Gateway | Best For | API Keys Needed |
|---------|----------|-----------------|
| **OpenRouter** (default) | Easiest setup, 100+ models via single key | `OPENROUTER_API_KEY` |
| **Requesty** | BYOK mode, analytics, cost tracking | `REQUESTY_API_KEY` + provider keys |
| **Direct** | Maximum control, direct provider APIs | Provider keys (Anthropic, OpenAI, Google) |
**Quick Start (OpenRouter):**
```bash
# Sign up at openrouter.ai and get your API key
export OPENROUTER_API_KEY="sk-or-v1-..."
```
**Direct Provider Access:**
```bash
# Use your existing provider API keys directly
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="..."
export LLM_COUNCIL_DEFAULT_GATEWAY=direct
```
**Requesty with BYOK:**
```bash
export REQUESTY_API_KEY="..."
export ANTHROPIC_API_KEY="sk-ant-..." # Your own key, routed through Requesty
export LLM_COUNCIL_DEFAULT_GATEWAY=requesty
```
### 2. Store Your API Keys Securely
Choose one of these options (in order of recommendation):
#### Option A: System Keychain (Most Secure)
Store keys encrypted in your OS keychain:
```bash
# Install with keychain support
pip install "llm-council-core[mcp,secure]"
# Store key securely (prompts for key, no echo)
llm-council setup-key
# For CI/CD automation, pipe from stdin:
echo "$OPENROUTER_API_KEY" | llm-council setup-key --stdin
```
#### Option B: Environment Variables
Set in your shell profile (`~/.zshrc`, `~/.bashrc`):
```bash
# OpenRouter (default gateway)
export OPENROUTER_API_KEY="sk-or-v1-..."
# Or use direct provider APIs
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="..."
```
#### Option C: Environment File
Create a `.env` file (ensure it's in `.gitignore`):
```bash
# For OpenRouter
echo "OPENROUTER_API_KEY=sk-or-v1-..." > .env
# Or for direct APIs
cat > .env << 'EOF'
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...
GOOGLE_API_KEY=...
LLM_COUNCIL_DEFAULT_GATEWAY=direct
EOF
```
> **Security Note**: Never put API keys in command-line arguments or JSON config files that might be committed to version control.
### 3. Customize Models (Optional)
You can customize which models participate in the council using three methods (in priority order):
#### Option 1: Environment Variables (Recommended)
```bash
# Comma-separated list of council models
export LLM_COUNCIL_MODELS="openai/gpt-4,anthropic/claude-3-opus,google/gemini-pro"
# Chairman model (synthesizes final response)
export LLM_COUNCIL_CHAIRMAN="anthropic/claude-3-opus"
```
#### Option 2: YAML Configuration (Recommended)
Create `llm_council.yaml` in your project root or `~/.config/llm-council/llm_council.yaml`:
```yaml
council:
# Tier configuration (ADR-022)
tiers:
default: high
pools:
quick:
models:
- openai/gpt-4o-mini
- anthropic/claude-3-5-haiku-20241022
timeout_seconds: 30
balanced:
models:
- openai/gpt-4o
- anthropic/claude-3-5-sonnet-20241022
timeout_seconds: 90
high:
models:
- openai/gpt-4o
- anthropic/claude-opus-4-7
- google/gemini-3-pro
timeout_seconds: 180
# Triage configuration (ADR-020)
triage:
enabled: false
wildcard:
enabled: true
prompt_optimization:
enabled: true
# Gateway configuration (ADR-023, ADR-025a)
gateways:
default: openrouter
fallback:
enabled: true
chain: [openrouter, ollama] # Can use Ollama as fallback
# Provider-specific configuration
providers:
ollama:
enabled: true
base_url: http://localhost:11434
timeout_seconds: 120.0
hardware_profile: recommended # minimum|recommended|professional|enterprise
openrouter:
enabled: true
base_url: https://openrouter.ai/api/v1/chat/completions
# Webhook notifications (ADR-025a)
webhooks:
enabled: false # Opt-in
timeout_seconds: 5.0
max_retries: 3
https_only: true
default_events:
- council.complete
- council.error
observability:
log_escalations: true
```
**Priority**: YAML config > Environment variables > Defaults
#### Option 3: JSON Configuration (Legacy)
Create `~/.config/llm-council/config.json`:
```json
{
"council_models": [
"openai/gpt-4-turbo",
"anthropic/claude-3-opus",
"google/gemini-pro",
"meta-llama/llama-3-70b-instruct"
],
"chairman_model": "anthropic/claude-3-opus",
"synthesis_mode": "consensus",
"exclude_self_votes": true,
"style_normalization": false,
"max_reviewers": null
}
```
#### Option 4: Use Defaults
If you don't configure anything, these defaults are used:
- Council: GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5, Grok 4
- Chairman: Gemini 3 Pro
- Mode: consensus
- Self-vote exclusion: enabled
**Finding Models**:
- OpenRouter: [openrouter.ai/models](https://openrouter.ai/models)
- Anthropic: [docs.anthropic.com/models](https://docs.anthropic.com/en/docs/about-claude/models)
- OpenAI: [platform.openai.com/docs/models](https://platform.openai.com/docs/models)
- Google: [ai.google.dev/gemini-api/docs/models](https://ai.google.dev/gemini-api/docs/models/gemini)
## Usage
### With Claude Code
```bash
# First, store your API key securely (one-time setup)
llm-council setup-key
# Then add the MCP server (key is read from keychain or environment)
claude mcp add --transport stdio llm-council --scope user -- llm-council
```
Then in Claude Code:
```
Consult the LLM council about best practices for error handling
```
### With Claude Desktop
First ensure your API key is available (via keychain, environment variable, or `.env` file).
Add to `~/Library/Application Support/Claude/claude_desktop_config.json`:
```json
{
"mcpServers": {
"llm-council": {
"command": "llm-council"
}
}
}
```
> **Note**: No `env` block needed—the key is resolved from your system keychain or environment automatically.
### With Other MCP Clients
Any MCP client can use the server by running:
```bash
llm-council
```
## Available Tools
### `consult_council`
Ask the LLM council a question and get synthesized guidance.
**Arguments:**
- `query` (string, required): The question to ask the council
- `confidence` (string, optional): Response quality level (default: "high")
- `"quick"`: Fast models (mini/flash/haiku), ~30 seconds - fast responses for simple questions
- `"balanced"`: Mid-tier models (GPT-4o, Sonnet), ~90 seconds - good balance of speed and quality
- `"high"`: Full council (Opus, GPT-4o), ~180 seconds - comprehensive deliberation
- `"reasoning"`: Deep thinking models (GPT-5.2, o1, DeepSeek-R1), ~600 seconds - complex reasoning
- `include_details` (boolean, optional): Include individual model responses and rankings (default: false)
- `verdict_type` (string, optional): Type of verdict to render (default: "synthesis")
- `"synthesis"`: Free-form natural language synthesis
- `"binary"`: Go/no-go decision (approved/rejected) with confidence score
- `"tie_breaker"`: Chairman resolves deadlocked decisions
- `include_dissent` (boolean, optional): Extract minority opinions from Stage 2 (default: false)
**Example:**
```
Use consult_council to ask: "What are the trade-offs between microservices and monolithic architecture?"
```
**Example with confidence level:**
```
Use consult_council with confidence="quick" to ask: "What's the syntax for a Python list comprehension?"
```
**Example with Jury Mode (binary verdict):**
```
Use consult_council with verdict_type="binary" and include_dissent=true to ask: "Should we approve this PR that adds caching?"
```
### `council_health_check`
Verify the council is working before expensive operations. Returns API connectivity status, configured models, and estimated response times.
**Arguments:** None
**Returns:**
- `api_key_configured`: Whether an API key was found
- `key_source`: Where the key came from ("environment", "keychain", or "config_file")
- `council_size`: Number of models in the council
- `estimated_duration`: Expected response times for each confidence level
- `ready`: Whether the council is ready to accept queries
**Example:**
```
Run council_health_check to verify the LLM council is working
```
## How It Works
The council uses a multi-stage process inspired by ensemble methods and peer review:
```
User Query
↓
┌─────────────────────────────────────────────┐
│ STAGE 1: Independent Responses │
│ • All council models queried in parallel │
│ • No knowledge of other responses │
│ • Graceful degradation if some fail │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ STAGE 1.5: Style Normalization (optional) │
│ • Rewrites responses in neutral style │
│ • Removes AI preambles and fingerprints │
│ • Strengthens anonymization │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ STAGE 2: Anonymous Peer Review │
│ • Responses labeled A, B, C (randomized) │
│ • XML sandboxing prevents prompt injection │
│ • JSON-structured rankings with scores │
│ • Self-votes excluded from aggregation │
└─────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────┐
│ STAGE 3: Chairman Synthesis │
│ • Receives all responses + rankings │
│ • Consensus mode: single best answer │
│ • Debate mode: highlights disagreements │
└─────────────────────────────────────────────┘
↓
Final Response + Metadata
```
This approach helps surface diverse perspectives, identify consensus, and produce more balanced, well-reasoned answers.
## Advanced Features
### Self-Vote Exclusion
By default, each model's vote for its own response is excluded from the aggregate rankings. This prevents self-preference bias.
```bash
export LLM_COUNCIL_EXCLUDE_SELF_VOTES=true # default
```
### Synthesis Modes
**Consensus Mode** (default): Chairman synthesizes a single best answer.
**Debate Mode**: Chairman highlights areas of agreement, key disagreements, and trade-offs between perspectives.
```bash
export LLM_COUNCIL_MODE=debate
```
### Style Normalization (Stage 1.5)
Optional preprocessing that rewrites all responses in a neutral style before peer review. This strengthens anonymization by removing stylistic "fingerprints" that might allow models to recognize each other.
```bash
export LLM_COUNCIL_STYLE_NORMALIZATION=true
export LLM_COUNCIL_NORMALIZER_MODEL=google/gemini-2.0-flash-001 # fast/cheap
```
### Stratified Sampling (Large Councils)
For councils with more than 5 models, you can limit the number of reviewers per response to reduce API costs (O(N²) → O(N×k)):
```bash
export LLM_COUNCIL_MAX_REVIEWERS=3
```
### Reliability Features
The council includes built-in reliability features for long-running operations:
**Tiered Timeouts**: Graceful degradation under time pressure:
- Per-model soft deadline: 15s (start planning fallback)
- Per-model hard deadline: 25s (abandon slow model)
- Global synthesis trigger: 40s (must start synthesis)
- Response deadline: 50s (must return something)
**Partial Results**: If some models timeout, the council returns results from the models that responded, with a clear warning indicating which models were excluded.
**Confidence Levels**: Use the `confidence` parameter to trade off speed vs. thoroughness:
- `quick`: ~20-30 seconds (fastest models)
- `balanced`: ~45-60 seconds (most models)
- `high`: ~60-90 seconds (full council, default)
**Progress Feedback**: During deliberation, progress updates show which models have responded and which are still pending:
```
✓ claude-opus-4.7 (1/4) | waiting: gpt-5.1, gemini-3-pro, grok-4
✓ gemini-3-pro (2/4) | waiting: gpt-5.1, grok-4
```
### Jury Mode (ADR-025b)
Transform the council from a "summary generator" to a "decision engine" with structured verdicts.
**Verdict Types:**
| Mode | Output | Use Case |
|------|--------|----------|
| `synthesis` (default) | Free-form synthesis | General questions, exploration |
| `binary` | approved/rejected + confidence | CI/CD gates, PR reviews, policy checks |
| `tie_breaker` | Chairman decides on deadlock | Contentious decisions |
**Binary Verdict Mode:**
Returns structured go/no-go decisions with confidence scores:
```python
# MCP Tool
result = await consult_council(
query="Should we approve this architectural change?",
verdict_type="binary"
)
# Returns:
{
"verdict": "approved", # or "rejected"
"confidence": 0.75, # 0.0-1.0 based on council agreement
"rationale": "Council agreed the change improves modularity..."
}
```
**Tie-Breaker Mode:**
When the council is deadlocked (top scores within 0.1 of each other), the chairman casts the deciding vote:
```python
result = await consult_council(
query="Option A vs Option B for caching strategy?",
verdict_type="binary"
)
# If deadlocked, returns:
{
"verdict": "approved",
"confidence": 0.60,
"rationale": "Chairman resolved deadlock based on...",
"deadlocked": true # Flag indicates tie-breaker was needed
}
```
**Constructive Dissent:**
Extract minority opinions from Stage 2 peer reviews:
```python
result = await consult_council(
query="Should we migrate to microservices?",
verdict_type="binary",
include_dissent=True
)
# Returns:
{
"verdict": "approved",
"confidence": 0.70,
"rationale": "Majority supports migration...",
"dissent": "Minority perspective: One reviewer noted scalability concerns with current team size."
}
```
**Dissent Algorithm:**
1. Collect all scores from Stage 2 peer reviews
2. Calculate median and standard deviation per response
3. Identify outliers (score < median - 1.5 × std)
4. Extract evaluation text from outliers
5. Format as minority perspective
**Example: CI/CD Gate Integration:**
```python
import asyncio
from llm_council.council import run_full_council
from llm_council.verdict import VerdictType
async def review_pull_request(pr_diff: str, pr_description: str):
"""Use LLM Council as a PR review gate."""
query = f"""
Review this pull request and determine if it should be approved.
Description: {pr_description}
Changes:
{pr_diff}
"""
stage1, stage2, stage3, metadata = await run_full_council(
query,
verdict_type=VerdictType.BINARY,
include_dissent=True
)
verdict = metadata.get("verdict", {})
if verdict.get("verdict") == "approved" and verdict.get("confidence", 0) >= 0.7:
return True, verdict.get("rationale")
else:
return False, f"{verdict.get('rationale')}\n\nDissent: {verdict.get('dissent', 'None')}"
```
**Environment Variables:**
| Variable | Description | Default |
|----------|-------------|---------|
| `LLM_COUNCIL_VERDICT_TYPE` | Default verdict type | synthesis |
| `LLM_COUNCIL_DEADLOCK_THRESHOLD` | Borda score difference for deadlock | 0.1 |
| `LLM_COUNCIL_DISSENT_THRESHOLD` | Std deviations below median for outlier | 1.5 |
| `LLM_COUNCIL_MIN_BORDA_SPREAD` | Minimum spread to surface dissent | 0.0 |
### Structured Rubric Scoring (ADR-016)
By default, reviewers provide a single 1-10 holistic score. With rubric scoring enabled, reviewers score each response on five dimensions:
| Dimension | Weight | Description |
|-----------|--------|-------------|
| **Accuracy** | 35% | Factual correctness, no hallucinations |
| **Relevance** | 10% | Addresses the actual question asked |
| **Completeness** | 20% | Covers all aspects of the question |
| **Conciseness** | 15% | Efficient communication, no padding |
| **Clarity** | 20% | Well-organized, easy to understand |
**Accuracy Ceiling**: When enabled (default), accuracy acts as a ceiling on the overall score:
- Accuracy < 5: Score caps at 4.0 (40%) — "significant errors or worse"
- Accuracy 5-6: Score caps at 7.0 (70%) — "mixed accuracy"
- Accuracy ≥ 7: No ceiling — "mostly accurate or better"
This prevents well-written hallucinations from ranking well.
**Scoring Anchors**: Each score level has defined behavioral meaning (see [ADR-016](docs/adr/ADR-016-structured-rubric-scoring.md)):
- 9-10: Excellent (completely accurate, comprehensive, crystal clear)
- 7-8: Good (mostly accurate, covers main points)
- 5-6: Mixed (some errors or gaps)
- 3-4: Poor (significant issues)
- 1-2: Failing (fundamentally flawed)
```bash
# Enable rubric scoring
export LLM_COUNCIL_RUBRIC_SCORING=true
# Customize weights (must sum to 1.0)
export LLM_COUNCIL_WEIGHT_ACCURACY=0.40
export LLM_COUNCIL_WEIGHT_RELEVANCE=0.10
export LLM_COUNCIL_WEIGHT_COMPLETENESS=0.20
export LLM_COUNCIL_WEIGHT_CONCISENESS=0.10
export LLM_COUNCIL_WEIGHT_CLARITY=0.20
```
### Safety Gate (ADR-016)
When enabled, a pass/fail safety check runs before rubric scoring to filter harmful content:
| Pattern | Description |
|---------|-------------|
| **dangerous_instructions** | Weapons, explosives, harmful devices |
| **weapon_making** | Firearm/weapon construction |
| **malware_hacking** | Unauthorized access, malware |
| **self_harm** | Self-harm encouragement |
| **pii_exposure** | Personal information leakage |
Responses that fail safety checks are capped at score 0 regardless of other dimension scores. Educational/defensive content is context-aware and allowed.
```bash
# Enable safety gate
export LLM_COUNCIL_SAFETY_GATE=true
# Customize score cap (default: 0)
export LLM_COUNCIL_SAFETY_SCORE_CAP=0.0
```
### Bias Auditing (ADR-015)
When enabled, the council reports per-session bias indicators from peer review scoring:
| Bias Type | Description | Detection Threshold |
|-----------|-------------|---------------------|
| **Length-Score Correlation** | Do longer responses score higher? | \|r\| > 0.3 |
| **Reviewer Calibration** | Are some reviewers harsh/generous relative to peers? | Mean ± 1 std from median |
| **Position Bias** | Does presentation order affect scores? | Variance > 0.5 |
**Output**: The metadata includes a `bias_audit` object with:
- `length_score_correlation`: Pearson correlation coefficient
- `length_bias_detected`: Boolean flag
- `position_score_variance`: Variance of mean scores by presentation position
- `position_bias_detected`: Boolean flag (derived from anonymization labels A, B, C...)
- `harsh_reviewers` / `generous_reviewers`: Lists of biased reviewers
- `overall_bias_risk`: "low", "medium", or "high"
**Important Limitations**: With only 4-5 models per session, these metrics have limited statistical power:
- Length correlation with n=5 data points can only detect *extreme* biases
- Position bias from a single ordering cannot distinguish position effects from quality differences
- Reviewer calibration is relative to the current session only
These are **per-session indicators** (red flags for extreme anomalies), not statistically robust proof of systematic bias. Interpret with appropriate caution.
```bash
# Enable bias auditing
export LLM_COUNCIL_BIAS_AUDIT=true
# Customize thresholds (optional)
export LLM_COUNCIL_LENGTH_CORRELATION_THRESHOLD=0.3
export LLM_COUNCIL_POSITION_VARIANCE_THRESHOLD=0.5
```
### Cross-Session Bias Aggregation (ADR-018)
For statistically meaningful bias detection across multiple sessions, enable bias persistence:
```bash
# Enable bias persistence (stores metrics locally)
export LLM_COUNCIL_BIAS_PERSISTENCE=true
# Consent level: 0=off, 1=local only (default), 2=anonymous, 3=enhanced, 4=research
export LLM_COUNCIL_BIAS_CONSENT=1
# Store path (default: ~/.llm-council/bias_metrics.jsonl)
export LLM_COUNCIL_BIAS_STORE=~/.llm-council/bias_metrics.jsonl
```
**Generate cross-session bias reports:**
```bash
# Text report (default)
llm-council bias-report
# JSON output
llm-council bias-report --format json
# Include detailed reviewer profiles
llm-council bias-report --verbose
# Limit to last 50 sessions
llm-council bias-report --sessions 50
```
**Statistical confidence tiers:**
| Sessions | Confidence | UI Treatment |
|----------|------------|--------------|
| N < 10 | Insufficient | "Collecting data..." |
| 10-19 | Preliminary | Warning shown |
| 20-49 | Moderate | CIs displayed |
| N >= 50 | High | Full analysis |
### Output Quality Metrics (ADR-036)
Quantify the reliability and quality of council outputs with three core metrics:
| Metric | Range | Description |
|--------|-------|-------------|
| **Consensus Strength Score (CSS)** | 0.0-1.0 | Agreement among council members in Stage 2 rankings |
| **Deliberation Depth Index (DDI)** | 0.0-1.0 | Thoroughness of the deliberation process |
| **Synthesis Attribution Score (SAS)** | 0.0-1.0 | How well synthesis traces back to source responses |
**CSS Interpretation:**
| Score | Interpretation | Action |
|-------|---------------|--------|
| 0.85+ | Strong consensus | High confidence in synthesis |
| 0.70-0.84 | Moderate consensus | Note minority views |
| 0.50-0.69 | Weak consensus | Consider `include_dissent=true` |
| <0.50 | Significant disagreement | Use debate mode |
**SAS Components:**
- `winner_alignment`: Similarity to top-ranked responses
- `max_source_alignment`: Best match to any response
- `hallucination_risk`: 1 - max_source_alignment
- `grounded`: True if synthesis traces to sources (threshold: 0.6)
**Usage:**
Quality metrics are automatically included in metadata:
```python
stage1, stage2, stage3, metadata = await run_full_council(query)
quality = metadata.get("quality_metrics", {})
core = quality.get("core", {})
print(f"Consensus: {core['consensus_strength']:.2f}")
print(f"Depth: {core['deliberation_depth']:.2f}")
print(f"Grounded: {core['synthesis_attribution']['grounded']}")
```
**MCP Tool Output:**
The `consult_council` MCP tool displays quality metrics with visual bars:
```
### Quality Metrics
- **Consensus Strength**: 0.82 [████████░░]
- **Deliberation Depth**: 0.74 [███████░░░]
- **Synthesis Grounded**: ✓ (alignment: 0.89)
```
**Environment Variables:**
```bash
# Enable/disable quality metrics (default: true)
export LLM_COUNCIL_QUALITY_METRICS=true
# Quality tier: core (OSS), standard, enterprise
export LLM_COUNCIL_QUALITY_TIER=core
```
### Gateway Layer (ADR-023)
The gateway layer provides an abstraction over LLM API requests with multiple gateway options:
**Available Gateways:**
| Gateway | Description | Key Features |
|---------|-------------|--------------|
| `OpenRouterGateway` | Routes through OpenRouter | 100+ models, single API key |
| `RequestyGateway` | Routes through Requesty | BYOK support, analytics |
| `DirectGateway` | Direct provider APIs | Anthropic, OpenAI, Google |
| `OllamaGateway` | Local LLMs via Ollama | Air-gapped, cost-free, privacy-first |
**Core Features:**
- **Circuit Breaker**: Prevents cascading failures (CLOSED → OPEN → HALF_OPEN → CLOSED)
- **Fallback Chains**: Automatic retry with secondary gateways on failure
- **Per-Gateway Metrics**: Track failure counts, latency, and health status
- **BYOK Support**: Bring Your Own Key for Requesty and Direct gateways
**Basic Usage:**
```python
from llm_council.gateway import (
GatewayRouter, GatewayRequest, CanonicalMessage, ContentBlock,
OpenRouterGateway, RequestyGateway, DirectGateway
)
# Single gateway (OpenRouter)
router = GatewayRouter()
# Multi-gateway with fallback
router = GatewayRouter(
gateways={
"openrouter": OpenRouterGateway(),
"requesty": RequestyGateway(byok_enabled=True, byok_keys={"anthropic": "sk-ant-..."}),
"direct": DirectGateway(provider_keys={"openai": "sk-..."}),
},
default_gateway="openrouter",
fallback_chains={"openrouter": ["requesty", "direct"]}
)
request = GatewayRequest(
model="openai/gpt-4o",
messages=[CanonicalMessage(role="user", content=[ContentBlock(type="text", text="Hello")])]
)
response = await router.complete(request)
```
**Enable the gateway layer:**
```bash
export LLM_COUNCIL_USE_GATEWAY=true
# Optional: Configure specific gateways
export REQUESTY_API_KEY=your-requesty-key # For Requesty
export ANTHROPIC_API_KEY=sk-ant-... # For Direct (Anthropic)
export OPENAI_API_KEY=sk-... # For Direct (OpenAI)
export GOOGLE_API_KEY=... # For Direct (Google)
```
**Circuit Breaker Behavior:**
- Default: 5 failures to trip the circuit
- Recovery timeout: 60 seconds
- Half-open state allows test requests to check recovery
- Open circuits are skipped in fallback chain
The gateway layer is currently **opt-in** (default: disabled) for backward compatibility.
### Triage Layer (ADR-020)
The triage layer provides query classification, model selection optimizations, and confidence-gated routing:
- **Confidence-Gated Fast Path**: Routes simple queries to a single model, escalating to full council when confidence is low
- **Shadow Council Sampling**: Random 5% sampling validates fast path quality against full council
- **Rollback Monitoring**: Automatic rollback when disagreement/escalation rates breach thresholds
- **Wildcard Selection**: Adds domain-specialized models to the council based on query classification
- **Prompt Optimization**: Per-model prompt adaptation (Claude gets XML, OpenAI gets Markdown)
- **Complexity Classification**: Heuristic-based with optional Not Diamond API integration
**Domain Categories:**
| Domain | Description | Specialist Models |
|--------|-------------|-------------------|
| CODE | Programming, debugging, algorithms | DeepSeek, Codestral |
| REASONING | Math, logic, proofs | o1-preview, DeepSeek-R1 |
| CREATIVE | Stories, poems, fiction | Claude Opus, Command-R+ |
| MULTILINGUAL | Translation, language | GPT-4o, Command-R+ |
| GENERAL | General knowledge | Llama 3 (fallback) |
**Enable wildcard selection:**
```bash
export LLM_COUNCIL_WILDCARD_ENABLED=true
```
This automatically adds a domain specialist to the council based on query classification. For example, a Python coding question will add a DeepSeek model alongside the default council.
**Enable prompt optimization:**
```bash
export LLM_COUNCIL_PROMPT_OPTIMIZATION_ENABLED=true
```
This applies per-model prompt formatting. Claude receives XML-structured prompts, while other providers receive their preferred format.
**Enable confidence-gated fast path:**
```bash
export LLM_COUNCIL_FAST_PATH_ENABLED=true
export LLM_COUNCIL_FAST_PATH_CONFIDENCE_THRESHOLD=0.92 # default
```
When enabled, simple queries are routed to a single model. If the model's confidence is below the threshold, the query automatically escalates to the full council. This can reduce costs by 45-55% on simple queries while maintaining quality.
**Fast Path Quality Monitoring:**
The fast path includes built-in quality monitoring:
| Metric | Threshold | Action |
|--------|-----------|--------|
| Shadow disagreement rate | > 8% | Automatic rollback |
| User escalation rate | > 15% | Automatic rollback |
| Error rate | > 1.5x baseline | Automatic rollback |
Configure monitoring:
```bash
export LLM_COUNCIL_SHADOW_SAMPLE_RATE=0.05 # 5% shadow sampling
export LLM_COUNCIL_ROLLBACK_ENABLED=true
export LLM_COUNCIL_ROLLBACK_WINDOW=100 # rolling window size
```
**Optional Not Diamond Integration:**
For advanced model routing, integrate with [Not Diamond](https://notdiamond.ai):
```bash
export NOT_DIAMOND_API_KEY="your-key"
export LLM_COUNCIL_USE_NOT_DIAMOND=true
```
When Not Diamond is unavailable, the system gracefully falls back to heuristic-based classification.
The triage layer is currently **opt-in** (default: disabled) for backward compatibility.
### Local LLM Support (ADR-025)
Run council deliberations entirely on local hardware using Ollama:
```bash
# Install with Ollama support
pip install "llm-council-core[ollama]"
# Start Ollama (if not already running)
ollama serve
# Pull a model
ollama pull llama3.2
```
**Use local models in your council:**
```bash
# Mix local and cloud models
export LLM_COUNCIL_MODELS="ollama/llama3.2,openai/gpt-4o,anthropic/claude-3-5-sonnet"
# Or use only local models (air-gapped mode)
export LLM_COUNCIL_MODELS="ollama/llama3.2,ollama/mistral,ollama/codellama"
export LLM_COUNCIL_CHAIRMAN="ollama/llama3.2"
```
**Hardware Requirements:**
| Profile | Hardware | Models | Use Case |
|---------|----------|--------|----------|
| Minimum | 8+ core CPU, 16GB RAM | 7B quantized | Dev/testing |
| Recommended | M-series Pro, 32GB unified | 7B-13B | Small council |
| Professional | 2x RTX 4090, 64GB+ | 70B quantized | Production |
| Enterprise | Mac Studio 64GB+ | Multiple 70B | Air-gapped |
**Quality Degradation Notice**: Local models typically have reduced capabilities compared to cloud-hosted frontier models. The gateway includes quality notices in responses to inform users when local models are used.
**Configuration:**
```bash
# Ollama endpoint (default: http://localhost:11434)
export LLM_COUNCIL_OLLAMA_BASE_URL=http://localhost:11434
# Timeout for local models (default: 120s - first load can be slow)
export LLM_COUNCIL_OLLAMA_TIMEOUT=120.0
```
### Webhook Notifications (ADR-025)
Receive real-time notifications as the council deliberates:
```python
from llm_council.webhooks import WebhookConfig, WebhookDispatcher, WebhookPayload
# Configure webhook endpoint
config = WebhookConfig(
url="https://your-server.com/webhook",
events=["council.complete", "council.error"],
secret="your-hmac-secret" # Optional, for signature verification
)
# Dispatch events (used internally by council)
dispatcher = WebhookDispatcher()
result = await dispatcher.dispatch(config, payload)
```
**Event Types:**
| Event | Description | Payload |
|-------|-------------|---------|
| `council.deliberation_start` | Council begins | request_id, models |
| `council.stage1.complete` | All initial responses received | response_count |
| `model.vote_cast` | A model submitted rankings | voter, ranking |
| `council.stage2.complete` | All rankings complete | aggregate_rankings |
| `council.complete` | Final answer ready | stage3_response, duration_ms |
| `council.error` | Error occurred | error, partial_results |
**HMAC Signature Verification:**
Webhooks are signed using HMAC-SHA256 for security:
```python
from llm_council.webhooks import verify_webhook_request
# Verify incoming webhook (in your server)
is_valid = verify_webhook_request(
payload=request.body,
headers=request.headers,
secret="your-hmac-secret"
)
```
Headers included:
- `X-Council-Signature`: `sha256=`
- `X-Council-Timestamp`: Unix timestamp
- `X-Council-Version`: `1.0`
**Configuration:**
```bash
# Webhook timeout (default: 5s)
export LLM_COUNCIL_WEBHOOK_TIMEOUT=5.0
# Max retry attempts (default: 3)
export LLM_COUNCIL_WEBHOOK_RETRIES=3
# Require HTTPS (except localhost) - default: true
export LLM_COUNCIL_WEBHOOK_HTTPS_ONLY=true
```
### SSE Streaming (ADR-025)
Stream council events in real-time using Server-Sent Events.
**Built-in HTTP Server:**
The library includes a built-in HTTP server with SSE streaming:
```bash
# Install with HTTP server support
pip install "llm-council-core[http]"
# Start the server
llm-council serve
```
The SSE endpoint is available at `GET /v1/council/stream`:
```bash
# Stream council deliberation
curl -N "http://localhost:8000/v1/council/stream?prompt=What+is+AI"
```
**Custom Integration:**
For custom FastAPI/Starlette applications:
```python
from llm_council.webhooks import (
council_event_generator,
SSE_CONTENT_TYPE,
get_sse_headers
)
# In your FastAPI/Starlette endpoint
@app.get("/v1/council/stream")
async def stream_council(prompt: str):
return StreamingResponse(
council_event_generator(prompt, models=None, api_key=None),
media_type=SSE_CONTENT_TYPE,
headers=get_sse_headers()
)
```
**Client-side (JavaScript):**
```javascript
const source = new EventSource('/v1/council/stream?prompt=...');
source.addEventListener('council.deliberation_start', (e) => {
console.log('Started:', JSON.parse(e.data));
});
source.addEventListener('council.complete', (e) => {
const result = JSON.parse(e.data);
console.log('Final answer:', result.stage3_response);
source.close();
});
```
### Offline Mode (ADR-026)
Run LLM Council without any external metadata calls using the bundled model registry:
```bash
# Enable offline mode
export LLM_COUNCIL_OFFLINE=true
```
When offline mode is enabled:
- Uses `StaticRegistryProvider` exclusively with 31 bundled models
- No external API calls for metadata (context windows, pricing, capabilities)
- All core council operations continue to work
- Unknown models use safe defaults (4096 context window)
This implements the "Sovereign Orchestrator" philosophy: the system must function as a complete, independent utility without external dependencies.
**Bundled Models (31 total):**
| Provider | Count | Examples |
|----------|-------|----------|
| OpenAI | 9 | gpt-4o, gpt-4o-mini, gpt-5.2, gpt-5-mini, o1, o1-preview, o1-mini, o3-mini |
| Anthropic | 6 | claude-opus-4.7, claude-opus-4.6, claude-sonnet-4.6, claude-haiku-4.5 |
| Google | 5 | gemini-3-pro, gemini-2.5-pro, gemini-2.0-flash, gemini-1.5-pro |
| xAI | 2 | grok-4, grok-4.1-fast |
| DeepSeek | 2 | deepseek-r1, deepseek-chat |
| Meta | 2 | llama-3.3-70b, llama-3.1-405b |
| Mistral | 2 | mistral-large-2411, mistral-medium |
| Ollama | 6 | llama3.2, mistral, qwen2.5:14b, codellama, phi3 |
**Model Metadata API:**
```python
from llm_council.metadata import get_provider
# Get the metadata provider
provider = get_provider()
# Query model information
info = provider.get_model_info("openai/gpt-4o")
print(f"Context window: {info.context_window}") # 128000
print(f"Quality tier: {info.quality_tier}") # frontier
# Check capabilities
window = provider.get_context_window("anthropic/claude-opus-4.7") # 1000000
supports = provider.supports_reasoning("openai/o1") # True
# List all available models
models = provider.list_available_models() # 31 models
```
### Agent Skills (ADR-034)
LLM Council includes agent skills for AI-assisted code verification, review, and CI/CD quality gates. Skills use progressive disclosure to minimize token usage while providing detailed scoring rubrics when needed.
**Available Skills:**
| Skill | Category | Use Case |
|-------|----------|----------|
| `council-verify` | verification | General work verification with multi-dimensional scoring |
| `council-review` | code-review | PR reviews with security, performance, and testing focus |
| `council-gate` | ci-cd | Quality gates for pipelines with exit codes (0=PASS, 1=FAIL, 2=UNCLEAR) |
**Exit Codes for CI/CD:**
```bash
# In GitHub Actions or any CI pipeline
llm-council gate --snapshot $GITHUB_SHA --rubric-focus Security
# Exit codes:
# 0 = PASS (confidence >= threshold, no blocking issues)
# 1 = FAIL (blocking issues present)
# 2 = UNCLEAR (needs human review)
```
**Skills are located in `.github/skills/`** and work with Claude Code, VS Code Copilot, Cursor, and other MCP-compatible clients.
For detailed documentation, see the [Skills Guide](https://llm-council.dev/guides/skills/).
### All Environment Variables
#### Gateway Configuration (ADR-023)
| Variable | Description | Default |
|----------|-------------|---------|
| `OPENROUTER_API_KEY` | OpenRouter API key | Required for OpenRouter gateway |
| `REQUESTY_API_KEY` | Requesty API key | Required for Requesty gateway |
| `ANTHROPIC_API_KEY` | Anthropic API key | Required for Direct gateway (Anthropic) |
| `OPENAI_API_KEY` | OpenAI API key | Required for Direct gateway (OpenAI) |
| `GOOGLE_API_KEY` | Google API key | Required for Direct gateway (Google) |
| `LLM_COUNCIL_DEFAULT_GATEWAY` | Default gateway (openrouter/requesty/direct) | openrouter |
| `LLM_COUNCIL_USE_GATEWAY` | Enable gateway layer with circuit breaker | false |
#### Ollama Configuration (ADR-025)
| Variable | Description | Default |
|----------|-------------|---------|
| `LLM_COUNCIL_OLLAMA_BASE_URL` | Ollama API endpoint | http://localhost:11434 |
| `LLM_COUNCIL_OLLAMA_TIMEOUT` | Timeout for Ollama requests (seconds) | 120.0 |
| `LLM_COUNCIL_USE_LITELLM` | Enable LiteLLM wrapper for Ollama | true |
#### Webhook Configuration (ADR-025)
| Variable | Description | Default |
|----------|-------------|---------|
| `LLM_COUNCIL_WEBHOOK_TIMEOUT` | Webhook POST timeout (seconds) | 5.0 |
| `LLM_COUNCIL_WEBHOOK_RETRIES` | Max retry attempts | 3 |
| `LLM_COUNCIL_WEBHOOK_HTTPS_ONLY` | Require HTTPS (except localhost) | true |
#### Council Configuration
| Variable | Description | Default |
|----------|-------------|---------|
| `LLM_COUNCIL_MODELS` | Comma-separated model list | GPT-5.1, Gemini 3 Pro, Claude 4.5, Grok 4 |
| `LLM_COUNCIL_CHAIRMAN` | Chairman model | google/gemini-3-pro-preview |
| `LLM_COUNCIL_MODE` | `consensus` or `debate` | consensus |
| `LLM_COUNCIL_EXCLUDE_SELF_VOTES` | Exclude self-votes | true |
| `LLM_COUNCIL_STYLE_NORMALIZATION` | Enable style normalization | false |
| `LLM_COUNCIL_NORMALIZER_MODEL` | Model for normalization | google/gemini-2.0-flash-001 |
| `LLM_COUNCIL_MAX_REVIEWERS` | Max reviewers per response | null (all) |
| `LLM_COUNCIL_RUBRIC_SCORING` | Enable multi-dimensional rubric scoring | false |
| `LLM_COUNCIL_ACCURACY_CEILING` | Use accuracy as score ceiling | true |
| `LLM_COUNCIL_WEIGHT_*` | Rubric dimension weights (ACCURACY, RELEVANCE, COMPLETENESS, CONCISENESS, CLARITY) | See above |
| `LLM_COUNCIL_SAFETY_GATE` | Enable safety pre-check gate | false |
| `LLM_COUNCIL_SAFETY_SCORE_CAP` | Score cap for failed safety checks | 0.0 |
| `LLM_COUNCIL_BIAS_AUDIT` | Enable bias auditing (ADR-015) | false |
| `LLM_COUNCIL_LENGTH_CORRELATION_THRESHOLD` | Length-score correlation threshold for bias detection | 0.3 |
| `LLM_COUNCIL_POSITION_VARIANCE_THRESHOLD` | Position variance threshold for bias detection | 0.5 |
| `LLM_COUNCIL_BIAS_PERSISTENCE` | Enable cross-session bias storage (ADR-018) | false |
| `LLM_COUNCIL_BIAS_STORE` | Path to bias metrics JSONL file | ~/.llm-council/bias_metrics.jsonl |
| `LLM_COUNCIL_BIAS_CONSENT` | Consent level: 0=off, 1=local, 2=anonymous, 3=enhanced, 4=research | 1 |
| `LLM_COUNCIL_BIAS_WINDOW_SESSIONS` | Rolling window: max sessions for aggregation | 100 |
| `LLM_COUNCIL_BIAS_WINDOW_DAYS` | Rolling window: max days for aggregation | 30 |
| `LLM_COUNCIL_MIN_BIAS_SESSIONS` | Minimum sessions for aggregation analysis | 20 |
| `LLM_COUNCIL_HASH_SECRET` | Secret for query hashing (RESEARCH consent only) | dev-secret |
| `LLM_COUNCIL_SUPPRESS_WARNINGS` | Suppress security warnings | false |
| `LLM_COUNCIL_MODELS_QUICK` | Models for quick tier (ADR-022) | gpt-4o-mini, haiku, gemini-flash |
| `LLM_COUNCIL_MODELS_BALANCED` | Models for balanced tier (ADR-022) | gpt-4o, sonnet, gemini-pro |
| `LLM_COUNCIL_MODELS_HIGH` | Models for high tier (ADR-022) | gpt-4o, opus, gemini-3-pro, grok-4 |
| `LLM_COUNCIL_MODELS_REASONING` | Models for reasoning tier (ADR-022) | gpt-5.2, opus, o1-preview, deepseek-r1 |
| `LLM_COUNCIL_WILDCARD_ENABLED` | Enable wildcard specialist selection (ADR-020) | false |
| `LLM_COUNCIL_PROMPT_OPTIMIZATION_ENABLED` | Enable per-model prompt optimization (ADR-020) | false |
| `LLM_COUNCIL_FAST_PATH_ENABLED` | Enable confidence-gated fast path (ADR-020) | false |
| `LLM_COUNCIL_FAST_PATH_CONFIDENCE_THRESHOLD` | Confidence threshold for fast path (0.0-1.0) | 0.92 |
| `LLM_COUNCIL_FAST_PATH_MODEL` | Model for fast path routing | auto |
| `LLM_COUNCIL_SHADOW_SAMPLE_RATE` | Shadow sampling rate (0.0-1.0) | 0.05 |
| `LLM_COUNCIL_SHADOW_DISAGREEMENT_THRESHOLD` | Disagreement threshold for shadow samples | 0.08 |
| `LLM_COUNCIL_ROLLBACK_ENABLED` | Enable rollback metric tracking | true |
| `LLM_COUNCIL_ROLLBACK_WINDOW` | Rolling window size for metrics | 100 |
| `LLM_COUNCIL_ROLLBACK_DISAGREEMENT_THRESHOLD` | Shadow disagreement rollback threshold | 0.08 |
| `LLM_COUNCIL_ROLLBACK_ESCALATION_THRESHOLD` | User escalation rollback threshold | 0.15 |
| `NOT_DIAMOND_API_KEY` | Not Diamond API key (optional) | - |
| `LLM_COUNCIL_USE_NOT_DIAMOND` | Enable Not Diamond API integration | false |
| `LLM_COUNCIL_NOT_DIAMOND_TIMEOUT` | Not Diamond API timeout in seconds | 5.0 |
| `LLM_COUNCIL_NOT_DIAMOND_CACHE_TTL` | Not Diamond response cache TTL in seconds | 300 |
## Credits & Attribution Continued
This project is a derivative work based on the original [llm-council](https://github.com/karpathy/llm-council) by Andrej Karpathy.
**Original Work:**
- Concept and 3-stage council orchestration: Andrej Karpathy
- Core council logic (Stage 1-3 process)
- OpenRouter integration
**Derivative Work by Amiable:**
- MCP (Model Context Protocol) server implementation
- Removal of web frontend (focus on MCP functionality)
- Python package structure for PyPI distribution
- User-configurable model selection
- Enhanced features (style normalization, self-vote exclusion, synthesis modes)
- Test suite and modern packaging standards
## License
MIT License - see [LICENSE](LICENSE) file for details.