An open API service indexing awesome lists of open source software.

https://github.com/sshoecraft/shepherd

An interactive multi-backend LLM runtime with intelligent cache eviction and persistent retrieval-augmented memory.
https://github.com/sshoecraft/shepherd

anthropic cli cpp cuda gemini grok inference kv-cache llama-cpp llm mcp ollama openai openai-server rag smart-evictions tensorrt tool-calling ulimited-context

Last synced: about 1 month ago
JSON representation

An interactive multi-backend LLM runtime with intelligent cache eviction and persistent retrieval-augmented memory.

Awesome Lists containing this project

README

          

# Shepherd

**Advanced Multi-Backend LLM System with Intelligent Memory Management**

Shepherd is a production-grade C++ LLM inference system supporting both local models (llama.cpp, TensorRT-LLM) and cloud APIs (OpenAI, Anthropic, Gemini, Ollama). It features sophisticated KV cache eviction policies, retrieval-augmented generation (RAG), hierarchical memory management, and comprehensive tool/function calling capabilities.

---

## Key Features

### πŸ”„ Multi-Backend Architecture
- **7 Backend Implementations**: Seamlessly switch between local inference and cloud APIs
- **Local**: llama.cpp (CPU/GPU), TensorRT-LLM (NVIDIA optimized)
- **Cloud**: OpenAI, Anthropic Claude, Google Gemini, Ollama
- **Unified Interface**: Single API across all backends with automatic fallback
- **Dynamic Model Selection**: Runtime backend selection based on model availability

### 🧠 Intelligent KV Cache Eviction
- **Callback-Driven Eviction** (llama.cpp): Interrupt-based eviction triggered by memory pressure
- **Event-Driven Eviction** (TensorRT): Asynchronous monitoring with `KVCacheEventManager`
- **Smart Message Preservation**: Protects system prompts and current context
- **Automatic Archival**: Evicted messages stored in RAG database for retrieval
- **Position Shift Management**: Maintains KV cache consistency during eviction

### πŸ“š RAG System (Retrieval-Augmented Generation)
- **Memory-Mapped SQLite Storage**: Persistent conversation memory with mmap I/O optimization
- **FTS5 Full-Text Search**: Fast semantic search with BM25 ranking algorithm
- **Fact Storage**: Persistent key-value store for long-term knowledge
- **Automatic Archival**: Seamless integration with eviction system
- **Configurable Retention**: Database size limits and auto-pruning (default 10GB)

### πŸ› οΈ Comprehensive Tools System
- **10+ Built-in Tools** across 6 categories:
- **Filesystem**: read_file, write_file, list_directory, delete_file, file_exists
- **Command Execution**: execute_command with timeout and signal handling
- **HTTP Client**: GET, POST, PUT, DELETE with custom headers
- **JSON Processing**: parse, validate, pretty-print, JSONPath extraction
- **Memory Management**: search_memory, set_fact, get_fact, store_memory
- **MCP Integration**: list_resources, read_resource, call_mcp_tool
- **Schema-Driven**: Structured parameter definitions with JSON schema support
- **Extensible**: Easy registration of custom tools via ToolRegistry

### πŸ”Œ Model Context Protocol (MCP)
- **Full MCP Client/Server**: JSON-RPC 2.0 over stdio
- **External Tool Delegation**: Seamlessly integrate external MCP servers
- **Resource Access**: Read resources from MCP servers (files, databases, APIs)
- **Prompt Templates**: Reusable prompt templates from MCP servers

### 🀝 Multi-Model Collaboration (API Tools)
- **Cross-Model Consultation**: Any backend can call other AI models as tools
- **Second Opinions**: Get Claude's analysis while using GPT, or vice versa
- **Model-Specific Strengths**: Leverage each model's expertise (Claude for code review, GPT for creative tasks, Gemini for reasoning)
- **Simple Configuration**: Add AI backends as tools via CLI (`shepherd api add`)
- **Automatic Discovery**: API tools registered alongside native tools at startup

### 🌐 Server Mode (HTTP REST API)
- **OpenAI-Compatible Endpoints**: Remote access to Shepherd via REST API
- `POST /v1/chat/completions` - Chat completions with streaming support
- `GET /v1/models` - List available models
- `GET /health` - Health check endpoint
- **Single-Session Architecture**: Stateful conversation with one client at a time
- **KV Cache Persistence**: Full conversation history maintained in KV cache
- **Tool Integration**: Client-side tool execution with OpenAI protocol
- **Remote Access**: Access local Shepherd instance from any OpenAI-compatible client

### πŸ’Ύ Hierarchical Memory Management
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GPU VRAM β”‚ β”‚ System RAM β”‚ β”‚ Storage β”‚
β”‚ (Tier 1) β”‚ ⟢ β”‚ (Tier 2) β”‚ ⟢ β”‚ (Tier 3) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ KV Cache β”‚ β”‚ β€’ Message Buffer β”‚ β”‚ β€’ SQLite DB β”‚
β”‚ β€’ Active Tokens β”‚ β”‚ β€’ Token Pools β”‚ β”‚ β€’ Conversation β”‚
β”‚ β€’ Model Weights β”‚ β”‚ β€’ Spill Cache β”‚ β”‚ β€’ Vector Store β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Hot (~ms) Warm (~ΞΌs) Cold (~10ms)
```

---

## Supported Backends

### Local Inference (Stateful)

#### **llama.cpp**
- **GPU Support**: CUDA with automatic layer offloading
- **Model Families**: Llama 3.x, Qwen 2.x, GLM-4, Mistral, Gemma, Phi, DeepSeek
- **Chat Templates**: Jinja2-based template rendering
- **Context Windows**: 8K - 256K+ tokens (model dependent)
- **KV Cache**: Callback-driven eviction with position tracking
- **Tokenization**: Accurate token counting via model vocabulary

#### **TensorRT-LLM**
- **Optimization**: NVIDIA TensorRT for maximum GPU performance
- **Multi-GPU**: MPI-based layer distribution across GPUs
- **Plugin Architecture**: Extensible with custom kernels
- **Context Windows**: 2K - 256K+ tokens
- **KV Cache**: Event-driven monitoring with asynchronous eviction
- **Production Ready**: Enterprise-grade inference with monitoring

### Cloud APIs (Stateless)

| Backend | Models | Context | Tools | Notes |
|---------|--------|---------|-------|-------|
| **OpenAI** | GPT-4, GPT-3.5-Turbo | 128K-200K | βœ“ | Function calling |
| **Anthropic** | Claude 3/3.5 (Opus, Sonnet, Haiku) | 200K | βœ“ | Separate system field |
| **Gemini** | Gemini Pro, 1.5 Pro/Flash, 2.0 | 32K-2M | βœ“ | SentencePiece tokens |
| **Ollama** | Any Ollama model | 8K-128K | βœ“ | Local/containerized |

---

## Architecture Overview

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Application Layer β”‚
β”‚ β€’ main.cpp: Interactive loop, tool execution β”‚
β”‚ β€’ HTTP Server: REST API for external integrations β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€v──────────────────────────────────────┐
β”‚ Backend Manager β”‚
β”‚ β€’ Unified interface across all backends β”‚
β”‚ β€’ Automatic backend selection and fallback β”‚
β”‚ β€’ Message formatting and tool conversion β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€v────┐ β”Œβ”€β”€β”€β”€β”€β”€v─────┐ β”Œβ”€β”€β”€β”€v────────┐
β”‚ LlamaCpp β”‚ β”‚ TensorRT β”‚ β”‚ API Backendsβ”‚
β”‚ Backend β”‚ β”‚ Backend β”‚ β”‚ (4 types) β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€v────────────────v──────────────v────────┐
β”‚ Session β”‚
β”‚ β€’ Message storage and token tracking β”‚
β”‚ β€’ Eviction calculation and archival β”‚
β”‚ β€’ Context utilization monitoring β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€v──┐ β”Œβ”€β”€β”€β”€v────┐ β”Œβ”€v────────────┐
β”‚ RAG β”‚ β”‚ Tools β”‚ β”‚ MCP Client β”‚
β”‚ System β”‚ β”‚ System β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## KV Cache Eviction System

### Why Eviction Matters

Modern LLMs have limited GPU memory for storing conversation context (KV cache). A 70B model with 128K context window requires ~240GB of KV cache memory. Shepherd's eviction system enables **indefinite conversations** by intelligently managing this constraint.

### Eviction Strategies

#### **llama.cpp: Callback-Driven**
```cpp
// Triggered by llama.cpp when KV cache fills
ctx_params.kv_need_space_callback = [](uint32_t tokens_needed, void* user_data) {
return backend->evict_to_free_space(tokens_needed);
};
```

**Process**:
1. Calculate oldest evictable messages (preserving system prompt and current message)
2. Map message indices to KV cache token positions
3. Remove token range from KV cache: `llama_memory_seq_rm()`
4. Shift remaining tokens to eliminate gaps: `llama_memory_seq_add()`
5. Archive evicted messages to RAG database
6. Insert eviction notice into context
7. Return new KV cache head position

#### **TensorRT: Event-Driven**
```cpp
// Background thread monitors KV cache events
void monitor_kv_events() {
auto events = event_mgr->getLatestEvents(100ms);
for (auto& event : events) {
if (event.type == KVCacheRemovedData) {
handle_eviction(event.block_hashes);
}
}
}
```

**Process**:
1. Asynchronous monitoring thread detects `KVCacheRemovedData` events
2. Estimate tokens removed (~64-128 tokens per block)
3. Calculate corresponding message range
4. Archive messages to RAG database
5. Update context state

### Protection Rules
- **System Message**: Never evicted (contains tools and instructions)
- **Current User Message**: Never evicted (needed for generation)
- **Recent Tool Calls**: Protected until response generated
- **Eviction Order**: Oldest-first (LRU) within evictable range

---

## Building Shepherd

For detailed build instructions, see [BUILD.md](BUILD.md).

---

## Configuration

### Backend Selection

Shepherd automatically detects available backends at runtime. Priority order:
1. **TensorRT-LLM** (if compiled and NVIDIA GPU detected)
2. **llama.cpp** (if compiled)
3. **API backends** (if API keys configured)

### Environment Variables

```bash
# API Keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="..."
export GROK_API_KEY="xai-..."

# Local Model Path (for llama.cpp/TensorRT)
export MODEL_PATH="/path/to/model.gguf"

# RAG Database
export RAG_DB_PATH="./shepherd_memory.db"
export RAG_MAX_SIZE="10737418240" # 10GB

# Backend Override
export SHEPHERD_BACKEND="llamacpp" # or "tensorrt", "openai", etc.
```

### Configuration File (shepherd.json)

```json
{
"backend": "llamacpp",
"model_path": "/models/llama-3.1-70b-instruct.gguf",
"context_size": 131072,
"gpu_layers": 48,
"api_keys": {
"openai": "sk-...",
"anthropic": "sk-ant-..."
},
"rag": {
"database_path": "./shepherd_memory.db",
"max_size_bytes": 10737418240,
"enable_archival": true
},
"tools": {
"enabled": ["filesystem", "http", "memory", "command"],
"disabled": ["command_execution"]
},
"mcp_servers": [
{
"name": "filesystem",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/user"]
}
],
"api_tools": [
{
"name": "ask_claude",
"backend": "anthropic",
"model": "claude-sonnet-4",
"api_key": "sk-ant-...",
"context_size": 200000
},
{
"name": "ask_gpt",
"backend": "openai",
"model": "gpt-4",
"api_key": "sk-..."
}
]
}
```

---

## Usage Examples

### Interactive Mode

```bash
$ ./shepherd /models/llama-3.1-70b-instruct.gguf

Shepherd LLM System v1.0
Backend: llama.cpp (CUDA)
Model: llama-3.1-70b-instruct
Context: 131072 tokens
Tools: 12 available

> What files are in the current directory?
[Tool Call: list_directory(path=".")]
[Tool Result: main.cpp, README.md, CMakeLists.txt, ...]

The current directory contains:
- main.cpp: Main application entry point
- README.md: Project documentation
- CMakeLists.txt: Build configuration
...

> Read the contents of README.md and summarize it
[Tool Call: read_file(path="README.md")]
[Tool Result: # Shepherd\n\nAdvanced Multi-Backend LLM System...]

This README describes Shepherd, a production-grade LLM inference system...
```

### Python API

```python
import requests

# Start conversation
response = requests.post("http://localhost:8000/v1/chat/completions", json={
"model": "llama-3.1-70b-instruct",
"messages": [
{"role": "user", "content": "What's the weather like?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "http_get",
"description": "Make HTTP GET request",
"parameters": {
"type": "object",
"properties": {
"url": {"type": "string"}
}
}
}
}
]
})

print(response.json())
```

### Tool Execution

```bash
# Enable specific tools
./shepherd --enable-tools filesystem,http,memory

# Disable dangerous tools
./shepherd --disable-tools command_execution

# List available tools
./shepherd --list-tools
```

### API Tools (Multi-Model Collaboration)

API tools enable any backend to call other AI models as tools, allowing for cross-model consultation and collaboration.

#### Configuration

```bash
# Add Claude as a tool
shepherd api add ask_claude anthropic \
--model claude-sonnet-4 \
--api-key sk-ant-...

# Add GPT as a tool
shepherd api add ask_gpt openai \
--model gpt-4 \
--api-key sk-... \
--max-tokens 8000

# Add local Ollama model
shepherd api add local_expert ollama \
--model llama3 \
--api-base http://localhost:11434

# List configured API tools
shepherd api list

# Remove a tool
shepherd api remove ask_claude
```

#### Usage Example

```bash
# Start with Gemini as primary backend
shepherd --backend gemini --model gemini-2.0-flash

> Read logger.h and ask the sonnet tool to review it

# Gemini reads the file
* read(file_path=logger.h)

# Gemini calls Claude (via sonnet tool) for code review
* sonnet(prompt=Please review the following C++ header file for...)

> # Logger Class Review
>
> **Critical Issues:**
>
> - **Singleton Static Initialization Order Fiasco**: The singleton pattern...
> - **Format Function Inefficiency**: The format_helper creates multiple string copies...
> - **Missing Error Handling**: No exception handling for file operations...
>
> **Design Improvements:**
>
> - **Replace Singleton with Dependency Injection**: Singletons make testing difficult...
> - **Use RAII for File Management**: Use std::ofstream directly...
> - **Structured Logging Support**: Support structured data in logs...
```

**Real-world benefits:**

- **Code Review**: Get Claude's detailed analysis while using a local model
- **Cross-Validation**: Compare opinions from multiple models
- **Model-Specific Strengths**: Use Claude for code, GPT for creative writing, Gemini for reasoning
- **Second Opinions**: Quick consultations without switching contexts

---

## Server Mode (HTTP REST API)

Shepherd can run as an HTTP server providing an OpenAI-compatible REST API for **remote access** to a local Shepherd instance. This enables you to access your local models from any OpenAI-compatible client, library, or tool.

**Important**: Server mode is **single-session** - designed for one user accessing their Shepherd instance remotely, not for multi-tenant production serving. For multi-user inference servers, consider vLLM or TGI.

**Example Use Cases**:
- Access your home server's GPU from your laptop
- Use OpenAI-compatible tools (like cursor.ai) with your local models
- Remote access to your Shepherd instance from anywhere
- Integration testing with OpenAI client libraries

**Not For**:
- Multi-user production deployments
- Serving multiple concurrent clients
- Building a SaaS product

### Starting the Server

```bash
# Start server with llamacpp backend
./shepherd --server --port 8000 --backend llamacpp --model /path/to/model.gguf

# Start with specific configuration
./shepherd --server --config server_config.json

# Server output:
Shepherd API Server starting...
Backend: llamacpp
Model: /models/qwen-3-30b.gguf
Listening on: http://0.0.0.0:8000

Endpoints:
POST http://0.0.0.0:8000/v1/chat/completions
GET http://0.0.0.0:8000/v1/models
GET http://0.0.0.0:8000/health
```

### API Endpoints

#### POST `/v1/chat/completions`

OpenAI-compatible chat completions endpoint.

**Request**:
```json
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "List files in current directory"}
],
"tools": [
{
"type": "function",
"function": {
"name": "list_directory",
"description": "List files in a directory",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"}
}
}
}
}
],
"temperature": 0.7,
"max_tokens": 150
}
```

**Response**:
```json
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1234567890,
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"tool_calls": [
{
"id": "call_123",
"type": "function",
"function": {
"name": "list_directory",
"arguments": "{\"path\":\".\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
]
}
```

#### GET `/v1/models`

List available models.

**Response**:
```json
{
"object": "list",
"data": [
{
"id": "gpt-4",
"object": "model",
"created": 1234567890,
"owned_by": "shepherd"
}
]
}
```

#### GET `/health`

Health check endpoint.

**Response**:
```json
{
"status": "healthy",
"backend": "llamacpp",
"model": "/models/qwen-3-30b.gguf"
}
```

### Session Management

Server mode implements a **single persistent session** that maintains the conversation state in the KV cache. This is designed for one user to access their Shepherd instance remotely.

#### How the Session Works

1. **Single Session**: The server maintains one active conversation session
2. **Message Accumulation**: Each request sends the **full conversation history** (OpenAI protocol)
3. **KV Cache Reuse**: Server maintains KV cache between requests for fast responses
4. **Prefix Matching**: Only new messages are processed; cached messages are skipped

**Note**: Unlike multi-tenant servers (vLLM, TGI), Shepherd's server mode is designed for **personal use** - think of it as remote access to your interactive Shepherd session, not as a production inference server.

**Example Session Flow**:

```python
import openai

client = openai.OpenAI(
api_key="dummy", # Not used
base_url="http://localhost:8000/v1"
)

# Request 1: Send initial message
response1 = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": "What is 2+2?"}
]
)
# Server: Creates session, caches system + user message
# KV Cache: [system_msg, user_msg, assistant_msg]

# Request 2: Send full history + new message
response2 = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."},
{"role": "user", "content": "What about 3+3?"}
]
)
# Server: Detects prefix match, only processes new user message
# KV Cache: [system_msg, user_msg, assistant_msg, user_msg, assistant_msg]
```

#### Session Context Structure

The server maintains a single session with:

```cpp
struct SessionContext {
std::vector messages; // Full conversation history
std::vector tools; // Available tools from client
size_t cached_message_count; // Messages currently in KV cache
};
```

**Key Implementation Details**:

1. **Message Replacement** (not append): Server replaces `session.messages` with each request
```cpp
session.messages.clear(); // Clear before adding new messages
for (const auto& msg : request["messages"]) {
session.messages.push_back(msg);
}
```

2. **Prefix Matching**: Compares incoming messages with cached messages
```cpp
// Find longest matching prefix
size_t prefix_len = 0;
for (size_t i = 0; i < min(cached_msgs, new_msgs); i++) {
if (cached_messages[i] == new_messages[i])
prefix_len++;
else break;
}
```

3. **Divergence Handling**: If conversation diverges, KV cache is cleared
```cpp
if (prefix_len < cached_message_count) {
LOG_WARN("Conversation diverged - clearing cache");
clear_kv_cache();
}
```

### Client Examples

#### Python (OpenAI Library)

```python
import openai

client = openai.OpenAI(
api_key="dummy",
base_url="http://localhost:8000/v1"
)

# Multi-turn conversation
messages = []

while True:
user_input = input("You: ")
if user_input.lower() == 'quit':
break

messages.append({"role": "user", "content": user_input})

response = client.chat.completions.create(
model="gpt-4",
messages=messages
)

assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})

print(f"Assistant: {assistant_msg}")
```

#### curl

```bash
# Single request
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'

# With tools
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [
{"role": "user", "content": "List files in /tmp"}
],
"tools": [
{
"type": "function",
"function": {
"name": "list_directory",
"description": "List files",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string"}
}
}
}
}
]
}'
```

### Configuration

Add to `shepherd.json`:

```json
{
"server_mode": {
"enabled": true,
"host": "0.0.0.0",
"port": 8000
},
"backend": "llamacpp",
"model_path": "/models/model.gguf"
}
```

Or use command-line options:

```bash
./shepherd --server --host 0.0.0.0 --port 8000 --backend llamacpp --model /path/to/model.gguf
```

### Important Notes

- **Single-User Architecture**: Server mode is designed for **one user** accessing their Shepherd instance remotely
- **Not multi-tenant**: Only one conversation at a time
- **Not production-ready for multiple users**: Use vLLM, TGI, or similar for multi-user serving
- **Use case**: Remote access to your local Shepherd (e.g., from laptop to home server)

- **Tools**: In server mode, tools are provided by the **client** in each request
- Server does **NOT** execute tools (client-side execution only)
- Server returns tool calls to client for execution
- Client executes tools and sends results back in next request

- **KV Cache Persistence**: The session's KV cache persists across requests
- Full conversation history maintained in memory
- Prefix matching provides vLLM-like performance
- Clear cache by restarting the server

- **Performance**: Server mode leverages KV cache for fast multi-turn conversations
- First request: ~1.5s (full prompt processing)
- Subsequent requests: ~200ms (only new tokens processed, 8-14x faster)

### KV Cache Persistence & Prefix Matching (vLLM-style)

Shepherd implements **automatic prefix caching** similar to vLLM's PagedAttention. When a new request arrives, the server:

1. **Compares** incoming messages with cached messages
2. **Detects** the longest matching prefix
3. **Reuses** KV cache for matching messages
4. **Only processes** new tokens after the prefix

**Example**:

```
Request 1: [system, user_1]
β†’ Server processes: system (4141 tokens) + user_1 (10 tokens)
β†’ KV cache: 4151 tokens
β†’ Time: ~1.5s (full processing)

Request 2: [system, user_1, assistant_1, user_2]
β†’ Server detects: system + user_1 match cached (4151 tokens)
β†’ Server processes ONLY: assistant_1 (27 tokens) + user_2 (8 tokens)
β†’ KV cache: 4151 + 35 = 4186 tokens
β†’ Time: ~200ms (90% faster!)

Request 3: [system, user_1, assistant_1, user_2, assistant_2, user_3]
β†’ Server detects: 4186 tokens already cached
β†’ Server processes ONLY: assistant_2 (105 tokens) + user_3 (12 tokens)
β†’ KV cache: 4186 + 117 = 4303 tokens
β†’ Time: ~180ms
```

**Logs showing prefix caching in action**:

```
[DEBUG] LlamaCpp generate_from_session called with 4 messages
[DEBUG] KV cache contains 2 messages, session has 4 messages
[DEBUG] Prefix match: 2 messages already cached
[DEBUG] Adding 2 new messages to KV cache
[DEBUG] Prompt already cached, skipping tokenization/decoding
[DEBUG] Generation limits: 150 max tokens (protected: system=4141 + user=7 + buffer=200)
[INFO ] Prompt processing: 13 tokens in 0.020000s (650 tokens/s) ← Only new tokens!
```

**Performance Comparison**:

| Scenario | Without Prefix Caching | With Prefix Caching | Speedup |
|----------|------------------------|---------------------|---------|
| Turn 1 (4151 tokens) | 1.5s | 1.5s | 1x |
| Turn 2 (35 new tokens) | 1.6s | 0.2s | **8x faster** |
| Turn 3 (117 new tokens) | 1.7s | 0.18s | **9x faster** |
| Turn 10 (50 new tokens) | 2.1s | 0.15s | **14x faster** |

**Why This Matters**:

- **Multi-turn conversations** become nearly instant after the first turn
- **System prompts** (often 4K+ tokens) are processed once and cached
- **Long context** doesn't slow down subsequent requests
- **Scales** to very long conversations without performance degradation

**Implementation Details**:

The prefix matching algorithm in `generate_from_session()`:

```cpp
// Compare cached messages with incoming messages
size_t prefix_match_count = 0;
for (size_t i = 0; i < std::min(kv_cached_message_count_, session.messages.size()); i++) {
if (kv_cached_messages_[i].role == session.messages[i].role &&
kv_cached_messages_[i].content == session.messages[i].content) {
prefix_match_count++;
} else {
// Conversation diverged - clear cache and start fresh
LOG_WARN("Conversation diverged at message " + std::to_string(i));
llama_kv_cache_clear(ctx);
prefix_match_count = 0;
break;
}
}

LOG_DEBUG("Prefix match: " + std::to_string(prefix_match_count) + " messages already cached");

// Only add new messages after the prefix
for (size_t i = prefix_match_count; i < session.messages.size(); i++) {
add_message_to_kv_cache(session.messages[i]);
}
```

This is functionally identical to vLLM's automatic prefix caching, providing the same performance benefits for multi-turn conversations.

---

## RAG System Usage

### Automatic Archival

When messages are evicted from KV cache, they're automatically archived:

```
[User]: What's the capital of France?
[Assistant]: Paris is the capital of France.

... 100K tokens later ...

[NOTICE: 2 messages evicted from context and archived to memory]
```

### Semantic Search

Retrieve relevant past conversations:

```python
# Search archived conversations
results = rag_manager.search("capital of France", max_results=5)

for result in results:
print(f"Relevance: {result.score}")
print(f"User: {result.user_message}")
print(f"Assistant: {result.assistant_response}")
```

### Fact Storage

Store persistent knowledge across sessions:

```bash
> Remember that my favorite color is blue
[Tool Call: set_fact(key="favorite_color", value="blue")]
[Tool Result: {"success": true}]

I've stored that your favorite color is blue.

... new session ...

> What's my favorite color?
[Tool Call: get_fact(key="favorite_color")]
[Tool Result: {"value": "blue"}]

Your favorite color is blue!
```

### Direct Memory Storage

Store question/answer pairs directly to long-term memory:

```bash
> Store this information: "What is the project deadline?" Answer: "March 15, 2024"
[Tool Call: store_memory(question="What is the project deadline?", answer="March 15, 2024")]
[Tool Result: {"success": true}]

Stored to long-term memory.

... later in conversation or new session ...

> Search for information about the deadline
[Tool Call: search_memory(query="project deadline", max_results=3)]
[Tool Result: "Found 1 archived conversation(s):
Result 1 [Relevance: 0.95]:
User: What is the project deadline?
Assistant: March 15, 2024"]

The project deadline is March 15, 2024.
```

---

## Hardware Recommendations

### Optimal Configuration
- **GPUs**: 2x NVIDIA RTX 3090 (48GB VRAM) or better
- **RAM**: 128GB system RAM
- **Storage**: NVMe SSD (1TB+)
- **CPU**: 16+ cores for parallel processing

### Minimum Configuration
- **GPU**: NVIDIA GTX 1080 Ti (11GB VRAM) or better
- **RAM**: 32GB system RAM
- **Storage**: SATA SSD (500GB)
- **CPU**: 8 cores

### Cloud Deployment
- **AWS**: g5.12xlarge (4x A10G, 192GB RAM)
- **GCP**: a2-highgpu-4g (4x A100, 340GB RAM)
- **Azure**: Standard_NC24ads_A100_v4 (A100 80GB)

---

## Performance Characteristics

### Throughput (70B model, batch_size=1)

| Backend | Prompt Speed | Generation Speed | Latency |
|---------|--------------|------------------|---------|
| TensorRT-LLM | 8000 tok/s | 45 tok/s | ~50ms |
| llama.cpp (CUDA) | 1200 tok/s | 25 tok/s | ~80ms |
| llama.cpp (CPU) | 150 tok/s | 8 tok/s | ~200ms |

### Memory Usage (70B model)

| Configuration | VRAM | System RAM | Context Size |
|---------------|------|------------|--------------|
| Q4_K_M + 64K ctx | 38GB | 8GB | 65536 tokens |
| Q4_K_M + 128K ctx | 42GB | 12GB | 131072 tokens |
| Q8_0 + 64K ctx | 72GB | 16GB | 65536 tokens |

### Eviction Performance

- **Eviction latency**: <10ms (llama.cpp), <5ms (TensorRT)
- **Archive write**: ~2ms per conversation turn (SQLite)
- **Search latency**: ~15ms (FTS5 full-text search)
- **Context shift**: O(n) where n = remaining tokens

---

## Development

### Project Structure

```
shepherd/
β”œβ”€β”€ backends/ # Backend implementations
β”‚ β”œβ”€β”€ api_backend.{cpp,h} # Base for API backends
β”‚ β”œβ”€β”€ llamacpp.{cpp,h} # llama.cpp backend
β”‚ β”œβ”€β”€ tensorrt.{cpp,h} # TensorRT-LLM backend
β”‚ β”œβ”€β”€ openai.{cpp,h} # OpenAI API
β”‚ β”œβ”€β”€ anthropic.{cpp,h} # Anthropic Claude
β”‚ β”œβ”€β”€ gemini.{cpp,h} # Google Gemini
β”‚ β”œβ”€β”€ grok.{cpp,h} # xAI Grok
β”‚ └── ollama.{cpp,h} # Ollama
β”œβ”€β”€ tools/ # Tool implementations
β”‚ β”œβ”€β”€ tool.{cpp,h} # Base tool interface
β”‚ β”œβ”€β”€ filesystem_tools.{cpp,h}
β”‚ β”œβ”€β”€ http_tools.{cpp,h}
β”‚ β”œβ”€β”€ command_tools.{cpp,h}
β”‚ β”œβ”€β”€ json_tools.{cpp,h}
β”‚ └── memory_tools.{cpp,h}
β”œβ”€β”€ mcp/ # Model Context Protocol
β”‚ β”œβ”€β”€ mcp_client.{cpp,h}
β”‚ β”œβ”€β”€ mcp_server.{cpp,h}
β”‚ └── mcp_manager.{cpp,h}
β”œβ”€β”€ server/ # HTTP REST API Server
β”‚ β”œβ”€β”€ api_server.py # FastAPI wrapper (spawned by C++)
β”‚ β”œβ”€β”€ requirements.txt # Python dependencies
β”‚ └── README.md # Server documentation
β”œβ”€β”€ server.{cpp,h} # C++ HTTP server implementation
β”œβ”€β”€ session_context.h # Session state management
β”œβ”€β”€ rag_system.{cpp,h} # RAG implementation
β”œβ”€β”€ memory_manager.{cpp,h} # Memory management
β”œβ”€β”€ context_manager.{cpp,h} # Context handling
β”œβ”€β”€ backend_manager.{cpp,h} # Backend orchestration
β”œβ”€β”€ model_config.h # Model-specific configuration
β”œβ”€β”€ tokenizer.{cpp,h} # Tokenization
β”œβ”€β”€ main.cpp # Application entry (interactive & server modes)
└── CMakeLists.txt # Build configuration
```

### Adding Custom Tools

```cpp
#include "tools/tool.h"

class MyCustomTool : public Tool {
public:
std::string unsanitized_name() const override {
return "my_custom_tool";
}

std::string description() const override {
return "Does something amazing";
}

std::string parameters() const override {
return R"({
"type": "object",
"properties": {
"input": {"type": "string", "description": "Input text"}
},
"required": ["input"]
})";
}

std::map execute(
const std::map& args) override
{
std::string input = std::any_cast(args.at("input"));
// Do something with input
return {{"result", "success"}};
}
};

// Register tool
ToolRegistry::instance().register_tool(std::make_unique());
```

### Adding Custom Backends

```cpp
#include "backend_interface.h"

class MyBackend : public Backend {
public:
bool initialize(const std::string& model_path) override {
// Initialize your backend
return true;
}

std::string generate(const std::string& prompt,
const GenerationParams& params) override {
// Generate response
return response;
}

void add_message(const Message& message) override {
// Add to context
}

// Implement other virtual methods...
};

// Register backend
BackendManager::register_backend("mybackend",
std::make_unique());
```

---

## Troubleshooting

### Out of Memory During Inference

**Solution**: Reduce context size or enable eviction
```bash
./shepherd --context-size 65536 --enable-eviction
```

### Slow Generation Speed

**Solution**: Increase GPU layers or use TensorRT
```bash
./shepherd --gpu-layers 48 --backend tensorrt
```

### Tool Execution Failures

**Solution**: Check tool permissions and enable verbose logging
```bash
./shepherd --verbose --log-level debug
```

### KV Cache Corruption

**Symptoms**: Repetitive or nonsensical output after eviction

**Solution**: Verify position shifting is working correctly. Check logs for:
```
[DEBUG] Evicting messages 2-5 (tokens 1024-4096)
[DEBUG] Shifting remaining tokens by -3072
[DEBUG] New KV cache size: 8192 tokens
```

### Infinite Tool Call Loops

**Symptoms**: Model repeatedly executes the same tool call in an infinite loop

**Root Cause**: Missing closing tags in KV cache after generation

This was a critical bug where assistant message closing tags (`<|im_end|>` for Qwen, `<|eot_id|>` for Llama) were never added to the KV cache after generation. This caused malformed context on the next generation cycle.

**Example of Bug**:
```
Iteration 1:
<|im_start|>assistant
{"id":"call_123","name":"list_directory"...} ← MISSING <|im_end|>!
...<|im_end|>

Iteration 2:
<|im_start|>assistant ← DUPLICATE! Previous message was never closed
{"id":"call_123","name":"list_directory"...} ← SAME TOOL CALL AGAIN!
```

**Fix Applied** (backends/llamacpp.cpp:1054-1076):

The backend now adds the closing tag to the KV cache after generation completes:

```cpp
// CRITICAL FIX: After generation, add the closing tag to KV cache
if (!model_config_.assistant_end_tag.empty()) {
// Tokenize and add the closing tag to KV cache
const llama_vocab* vocab = llama_model_get_vocab(static_cast(model_));
std::vector closing_tokens(16);
int n_closing = llama_tokenize(vocab, model_config_.assistant_end_tag.c_str(),
model_config_.assistant_end_tag.length(),
closing_tokens.data(), closing_tokens.size(), false, true);

if (n_closing > 0) {
closing_tokens.resize(n_closing);
llama_batch closing_batch = llama_batch_get_one(closing_tokens.data(), n_closing);
if (llama_decode(ctx, closing_batch) != 0) {
LOG_WARN("Failed to decode closing tag into KV cache");
} else {
LOG_DEBUG("Added closing tag to KV cache: " + model_config_.assistant_end_tag);
}
}
}
```

**Verification**: Check logs for confirmation:
```
[DEBUG] Added closing tag to KV cache: <|im_end|>
```

### Duplicate Messages in Server Mode

**Symptoms**: Same messages appear multiple times in KV cache during server mode conversations

**Root Cause**: Server was appending messages instead of replacing them

The OpenAI protocol sends the **full conversation history** with each request. The server must **replace** the session messages, not append them.

**Example of Bug**:
```
Request 1: [system, user] β†’ session.messages has 2 messages
Request 2: [system, user, assistant, tool] β†’ session.messages has 2 + 4 = 6 messages (duplicates!)
```

**Fix Applied** (server.cpp:160):

```cpp
// OpenAI protocol sends FULL conversation history each time, so REPLACE not append
session.messages.clear(); // Clear before adding new messages
for (const auto& msg : request["messages"]) {
session.messages.push_back(m);
}
```

**Verification**: Check server logs show no duplicates:
```
[DEBUG] === MESSAGES IN KV CACHE ===
[DEBUG] [system] <|im_start|>system You are a helpful assistant...
[DEBUG] [user] list the files
[DEBUG] [assistant] {"id":"call_123"...
[DEBUG] [tool] Here are the files...
[DEBUG] === END KV CACHE ===
```

---

## Architecture Improvements

### Model-Agnostic Backend Design

Previously, backends contained hardcoded model family checks:

```cpp
// OLD CODE (BAD):
if (model_config_.family == ModelFamily::QWEN_2_X) {
generation_prompt = "<|im_start|>assistant\n";
closing_tag = "<|im_end|>\n";
} else if (model_config_.family == ModelFamily::LLAMA_3_X) {
generation_prompt = "<|start_header_id|>assistant<|end_header_id|>\n\n";
closing_tag = "<|eot_id|>";
}
```

This violated the separation of concerns - backends shouldn't know about specific model formats.

**New Design**: Model-specific tags are defined in `ModelConfig`:

```cpp
// model_config.h
struct ModelConfig {
std::string assistant_start_tag; // e.g., "<|im_start|>assistant\n"
std::string assistant_end_tag; // e.g., "<|im_end|>\n"
// ...
};

// Populated by factory methods:
static ModelConfig create_qwen() {
return ModelConfig{
// ...
.assistant_start_tag = "<|im_start|>assistant\n",
.assistant_end_tag = "<|im_end|>\n"
};
}
```

**Backend Usage** (now model-agnostic):

```cpp
// NEW CODE (GOOD):
std::string generation_prompt = model_config_.assistant_start_tag;
// ... generate ...
if (!model_config_.assistant_end_tag.empty()) {
// Add closing tag
}
```

**Benefits**:
- Backends are model-agnostic
- Easy to add new model families without changing backend code
- Configuration-driven behavior
- Better separation of concerns

---

## Documentation

- **API Reference**: [docs/api.md](docs/api.md)
- **Architecture Guide**: [docs/architecture.md](docs/architecture.md)
- **Tool Development**: [docs/tools.md](docs/tools.md)
- **Backend Development**: [docs/backends.md](docs/backends.md)
- **MCP Integration**: [docs/mcp.md](docs/mcp.md)

---

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

### Areas for Contribution
- Additional backend integrations
- New tool implementations
- Performance optimizations
- Documentation improvements
- Test coverage

---

## License

**PolyForm Shield License 1.0.0**

Copyright (C) 2024-2025 Stephen P. Shoecraft

This project is licensed under the PolyForm Shield License 1.0.0. You are free to:
- βœ… Use Shepherd for any purpose (personal, commercial, internal business use)
- βœ… Modify and create derivative works
- βœ… Distribute copies

**Restrictions**:
- ❌ Cannot sell Shepherd or derivatives as a standalone product
- ❌ Cannot offer Shepherd as a paid service (SaaS)
- ❌ Cannot create competing products using Shepherd

For the full license text, see [LICENSE](LICENSE) or visit:
https://polyformproject.org/licenses/shield/1.0.0/

**Commercial Licensing**: For use cases not covered by the PolyForm Shield License, please contact the author for alternative licensing options.

---

## Acknowledgments

- **llama.cpp**: Georgi Gerganov and contributors
- **TensorRT-LLM**: NVIDIA Corporation
- **Model Context Protocol**: Anthropic
- **SQLite**: D. Richard Hipp

---

## Contact

- **Issues**: https://github.com/sshoecraft/shepherd/issues
- **Discussions**: https://github.com/sshoecraft/shepherd/discussions