{"id":47621310,"url":"https://github.com/legionio/legion-llm","last_synced_at":"2026-05-31T03:03:29.583Z","repository":{"id":344218639,"uuid":"1180552296","full_name":"LegionIO/legion-llm","owner":"LegionIO","description":"LLM integration for LegionIO - chat, embeddings, tool use, and agents via ruby_llm","archived":false,"fork":false,"pushed_at":"2026-04-22T07:25:26.000Z","size":1726,"stargazers_count":1,"open_issues_count":2,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-22T09:37:46.532Z","etag":null,"topics":["ai","legion-core","legionio","llm","ruby"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LegionIO.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-03-13T06:44:16.000Z","updated_at":"2026-04-22T07:24:09.000Z","dependencies_parsed_at":"2026-04-02T03:08:00.146Z","dependency_job_id":null,"html_url":"https://github.com/LegionIO/legion-llm","commit_stats":null,"previous_names":["legionio/legion-llm"],"tags_count":99,"template":false,"template_full_name":null,"purl":"pkg:github/LegionIO/legion-llm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LegionIO%2Flegion-llm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LegionIO%2Flegion-llm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LegionIO%2Flegion-llm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LegionIO%2Flegion-llm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LegionIO","download_url":"https://codeload.github.com/LegionIO/legion-llm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LegionIO%2Flegion-llm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32352406,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-27T17:12:42.749Z","status":"ssl_error","status_checked_at":"2026-04-27T17:12:41.658Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","legion-core","legionio","llm","ruby"],"created_at":"2026-04-01T22:13:56.471Z","updated_at":"2026-05-16T01:02:09.147Z","avatar_url":"https://github.com/LegionIO.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Legion LLM\n\nLLM routing and provider orchestration for the [LegionIO](https://github.com/LegionIO/LegionIO) framework. Routes chat, embeddings, tool use, fleet dispatch, auditing, and provider metadata through Legion-native `lex-llm-*` provider extensions.\n\n**Version**: 0.9.0\n\n## Installation\n\n```ruby\ngem 'legion-llm'\n```\n\nOr add to your Gemfile and `bundle install`.\n\n## OpenAI / Anthropic API Compatibility\n\nAny tool built for the OpenAI or Anthropic API can talk to Legion by changing its `base_url`. No custom headers, no special config — path determines the response format.\n\n### Routes\n\n| Method | Path | Description |\n|--------|------|-------------|\n| `POST` | `/v1/chat/completions` | OpenAI-compatible chat (streaming via `data: [DONE]`) |\n| `GET`  | `/v1/models` | Unified model catalog across all providers |\n| `GET`  | `/v1/models/:id` | Single model detail |\n| `POST` | `/v1/embeddings` | OpenAI-compatible embedding generation |\n| `POST` | `/v1/messages` | Anthropic-compatible messages (streaming via typed events) |\n\n### Usage\n\nPoint any OpenAI SDK at Legion:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:4567/v1\", api_key=\"your-legion-key\")\nresponse = client.chat.completions.create(\n    model=\"us.anthropic.claude-sonnet-4-6-v1\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello\"}]\n)\n```\n\nOr any Anthropic SDK:\n\n```python\nfrom anthropic import Anthropic\n\nclient = Anthropic(base_url=\"http://localhost:4567/v1\", api_key=\"your-legion-key\")\nresponse = client.messages.create(\n    model=\"us.anthropic.claude-sonnet-4-6-v1\",\n    max_tokens=1024,\n    messages=[{\"role\": \"user\", \"content\": \"Hello\"}]\n)\n```\n\nRequests flow through the full Inference pipeline — routing, metering, audit, quality checks — then the response is translated back into the caller's expected format.\n\n### Streaming\n\nBoth formats supported with correct SSE shapes:\n- **OpenAI**: `data: {\"choices\":[{\"delta\":{\"content\":\"...\"}}]}` chunks, terminated by `data: [DONE]`\n- **Anthropic**: Typed events — `message_start`, `content_block_start`, `content_block_delta`, `content_block_stop`, `message_delta`, `message_stop`\n- **Native**: `/api/llm/inference` streams `text-delta`, optional `thinking-delta` events when `include_thinking: true`, tool lifecycle events, and a final `done` event. Structured provider content blocks are flattened to plain text in both streaming and non-streaming native responses so `content` remains a string for daemon clients.\n\n### API Authentication\n\nConfig-driven via `settings[:llm][:api][:auth]`. Disabled by default for local dev and lite mode.\n\n```json\n{\n  \"llm\": {\n    \"api\": {\n      \"auth\": {\n        \"enabled\": false,\n        \"api_keys\": [\"key-1\", \"key-2\"],\n        \"pass_through\": false\n      }\n    }\n  }\n}\n```\n\n| Field | Type | Default | Description |\n|-------|------|---------|-------------|\n| `enabled` | Boolean | `false` | Enable auth for `/v1/` routes |\n| `api_keys` | Array | `[]` | Accepted Bearer tokens and `x-api-key` values |\n| `pass_through` | Boolean | `false` | Forward client token to the upstream provider instead of using Legion's own credentials |\n\nWhen enabled, validates `Authorization: Bearer \u003ctoken\u003e` or `x-api-key` headers against the configured `api_keys` list. The client authenticates to Legion; Legion authenticates to providers separately.\n\n## Configuration\n\nProvider defaults now live in each `lex-llm-*` provider extension. `legion-llm` ships an empty `providers: {}` hash; settings files and extension registrations populate it at runtime.\n\nAdd to your LegionIO settings directory (e.g. `~/.legionio/settings/llm.json`):\n\n```json\n{\n  \"llm\": {\n    \"default_model\": \"us.anthropic.claude-sonnet-4-6-v1\",\n    \"default_provider\": \"bedrock\",\n    \"providers\": {\n      \"bedrock\": {\n        \"enabled\": true,\n        \"region\": \"us-east-2\",\n        \"bearer_token\": [\"vault://secret/data/llm/bedrock#bearer_token\", \"env://AWS_BEARER_TOKEN\"]\n      },\n      \"ollama\": {\n        \"enabled\": true,\n        \"base_url\": \"http://localhost:11434\",\n        \"instances\": {\n          \"default\": { \"base_url\": \"http://localhost:11434\" },\n          \"gpu_server\": { \"base_url\": \"http://gpu-server:11434\" }\n        }\n      }\n    }\n  }\n}\n```\n\nCredentials are resolved automatically by the universal secret resolver in `legion-settings` (v1.3.0+). Use `vault://` URIs for Vault secrets, `env://` for environment variables, or plain strings for static values. Array values act as fallback chains -- the first non-nil result wins.\n\n### Provider Extensions (lex-llm-*)\n\nEach provider is a standalone `lex-llm-*` gem that ships its own `default_settings`, model catalog, capability declarations, and optional provider-owned fleet worker actor. When a provider gem is loaded, `legion-llm` discovers it through the shared `lex-llm` provider contract and registers provider instances for routing. Provider gems implement:\n\n- **`default_settings`** -- Connection defaults (base_url, region, API key env vars)\n- **`model_allowed?(model_name)`** -- Provider-level model filtering\n- **`Model::Info`** -- Real capabilities, context lengths, and parameter counts for each model\n\nThe routing layer only sees models the provider has already filtered and annotated.\n\n### Multi-Instance Providers\n\nLocal and fleet providers (Ollama, vLLM, MLX) support multiple named instances:\n\n```json\n{\n  \"ollama\": {\n    \"enabled\": true,\n    \"instances\": {\n      \"macbook\":    { \"base_url\": \"http://localhost:11434\" },\n      \"gpu_server\": { \"base_url\": \"http://gpu-server:11434\" }\n    }\n  }\n}\n```\n\nDiscovery scans all instances in parallel, enriches models with `/api/show` metadata, and generates per-instance routing rules. Each instance appears independently in the routing table so the router can target the exact hardware.\n\n### Capability-Aware Routing\n\nRouting rules and auto-generated rules carry `model_capabilities`, `context_length`, and `parameter_count` from provider-supplied `Model::Info`. The router uses these to match capability requirements (e.g., `thinking`, `vision`, `tools`) without a static lookup table.\n\n### Generic Dispatch\n\n`Call::Dispatch.call` accepts a `capability:` parameter (`:chat`, `:stream`, `:embed`) and routes to the registered `lex-llm-*` adapter. This replaces the old provider-specific dispatch paths.\n\n### Memory Gate\n\nDiscovery checks available system memory (macOS `vm_stat`/`sysctl`, Linux `/proc/meminfo`) before routing to local models. Models that exceed available RAM minus `discovery.memory_floor_mb` are silently skipped.\n\n### Credential Resolution\n\nAll credential fields support the universal `vault://` and `env://` URI schemes provided by `legion-settings`. Use array values for fallback chains:\n\n```json\n{\n  \"bedrock\": {\n    \"enabled\": true,\n    \"api_key\": [\"vault://secret/data/llm/bedrock#access_key\", \"env://AWS_ACCESS_KEY_ID\"],\n    \"secret_key\": [\"vault://secret/data/llm/bedrock#secret_key\", \"env://AWS_SECRET_ACCESS_KEY\"],\n    \"bearer_token\": [\"vault://secret/data/llm/bedrock#bearer_token\", \"env://AWS_BEARER_TOKEN\"],\n    \"region\": \"us-east-2\"\n  }\n}\n```\n\nBy the time `Legion::LLM.start` runs, all `vault://` and `env://` references have already been resolved to plain strings by `Legion::Settings.resolve_secrets!` (called in the boot sequence after `Legion::Crypt.start`). The `env://` scheme works even when Vault is not connected.\n\n### Auto-Detection\n\nIf no `default_model` or `default_provider` is set, legion-llm auto-detects from the first enabled provider. The detection order and default models are defined by each `lex-llm-*` provider extension's `default_settings`.\n\n## Core API\n\n### Lifecycle\n\n```ruby\nLegion::LLM.start       # Configure providers, warm discovery caches, set defaults, ping provider\nLegion::LLM.shutdown     # Mark disconnected, clean up\nLegion::LLM.started?     # -\u003e Boolean\nLegion::LLM.settings     # -\u003e Hash (current LLM settings)\n```\n\n### One-Shot Ask\n\n`Legion::LLM.ask` is a convenience method for single-turn requests. It routes daemon-first via the LegionIO REST API when configured, otherwise it uses the native provider router:\n\n```ruby\n# Synchronous response\nresult = Legion::LLM.ask(message: \"What is the capital of France?\")\nputs(result[:response] || result[:content])\n\n# Daemon immediate/created responses return the daemon body hash.\n# Native direct routing and async poll completion return:\n#   { status: :done, response: \"...\", meta: { ... } }\n# HTTP 403 raises DaemonDeniedError; HTTP 429 raises DaemonRateLimitedError.\n```\n\nConfigure daemon routing under `llm.daemon`:\n\n```json\n{\n  \"llm\": {\n    \"daemon\": {\n      \"enabled\": true,\n      \"url\": \"http://127.0.0.1:4567\"\n    }\n  }\n}\n```\n\nLarge async responses that overflow the cache spool to disk under\n`llm.prompt_caching.response_cache.spool_dir` (default:\n`~/.legionio/data/spool/llm_responses`).\n\n### Chat\n\n`Legion::LLM.chat` executes request-shaped calls through native provider dispatch. Provide `message:` or `messages:` so the request can be routed through the Inference pipeline.\n\n```ruby\n# Immediate execution through the request path\nresult = Legion::LLM.chat(message: \"What is the capital of France?\")\n\n# Explicit multi-message request\nresult = Legion::LLM.chat(\n  messages: [\n    { role: :user, content: \"Summarize the meeting notes\" },\n    { role: :assistant, content: \"Notes received.\" },\n    { role: :user, content: \"Now produce the summary\" }\n  ]\n)\n\n```\n\n### Embeddings\n\n```ruby\nembedding = Legion::LLM.embed(\"some text to embed\")\nembedding.vectors  # -\u003e Array of floats\n\n# Specific model\nembedding = Legion::LLM.embed(\"text\", model: \"text-embedding-3-small\")\n```\n\n### Tool Use\n\nDefine tools as native tool definitions or registered Legion tool classes. The inference executor forwards tool definitions to native providers and dispatches tool calls through `Inference::ToolDispatcher`:\n\n```ruby\nresponse = Legion::LLM.chat(\n  message: \"What's the weather in Minneapolis?\",\n  tools: [\n    {\n      name: \"weather_lookup\",\n      description: \"Look up current weather for a location\",\n      parameters: {\n        type: \"object\",\n        properties: {\n          location: { type: \"string\" },\n          units: { type: \"string\", enum: %w[celsius fahrenheit] }\n        },\n        required: [\"location\"]\n      }\n    }\n  ]\n)\n```\n\n### Structured Output\n\nUse structured output with a JSON schema:\n\n```ruby\nresult = Legion::LLM.structured(\n  messages: [{ role: :user, content: \"Analyze: 'I love this product!'\" }],\n  schema: {\n    type: \"object\",\n    properties: {\n      sentiment: { type: \"string\", enum: %w[positive negative neutral] },\n      confidence: { type: \"number\" },\n      reasoning: { type: \"string\" }\n    },\n    required: %w[sentiment confidence reasoning]\n  }\n)\n```\n\n## Types\n\nv0.8.0 introduces first-class immutable value types implemented as `Data.define` structs. These replace plain hashes throughout the pipeline, API translators, audit, and metering flows.\n\n| Type | Module | Purpose |\n|------|--------|---------|\n| `Message` | `Legion::LLM::Types::Message` | Conversation message with role, content, tool_calls, token counts |\n| `ToolCall` | `Legion::LLM::Types::ToolCall` | Tool invocation with name, arguments, status, duration, result |\n| `ContentBlock` | `Legion::LLM::Types::ContentBlock` | Typed content block (text, thinking, tool_use, tool_result) with cache_control |\n| `Chunk` | `Legion::LLM::Types::Chunk` | Streaming delta (content, thinking, tool_call, done) |\n\nEach type provides factory methods (`build`, `from_hash`, `text`, `tool_use`, etc.) and serialization helpers (`to_provider_hash`, `to_audit_hash`). All types are immutable after construction.\n\n## Module Structure\n\n```\nLegion::LLM (lib/legion/llm.rb)          # Thin facade — delegates to Inference, Call, Discovery\n├── Errors                               # Typed error hierarchy (LLMError base + subtypes, retryable?)\n├── Types                                # Immutable Data.define structs\n│   ├── Message      # role, content, tool_calls, tokens, conversation_id, task_id\n│   ├── ToolCall     # name, arguments, source, status, duration_ms, result\n│   ├── ContentBlock # type, text, data, tool_use/result fields, cache_control\n│   └── Chunk        # Streaming delta: content_delta / thinking_delta / tool_call_delta / done\n├── Config                               # Settings and defaults\n│   └── Settings     # Default config, provider settings, routing defaults, API auth defaults\n├── Call                                 # Native provider call layer\n│   ├── Providers        # Provider configuration, auto-detect, verify\n│   ├── Registry         # Thread-safe lex-* provider extension registry\n│   ├── Dispatch         # Native provider dispatch to registered lex-* extensions\n│   ├── Embeddings       # generate, generate_batch, default_model, fallback chain\n│   ├── StructuredOutput # JSON schema enforcement with native response_format and prompt fallback\n│   ├── DaemonClient     # HTTP routing to LegionIO daemon with 30s health cache\n│   ├── ClaudeConfigLoader # Import Claude CLI config from ~/.claude/settings.json\n│   └── CodexConfigLoader  # Import OpenAI bearer token from ~/.codex/auth.json\n├── Context                              # Prompt and conversation context management\n│   ├── Compressor   # Deterministic prompt compression (3 levels, code-block-aware)\n│   └── Curator      # Async conversation curation: strip thinking, distill tools, fold resolved exchanges\n├── Discovery                            # Runtime introspection\n│   ├── Ollama       # Queries Ollama /api/tags for pulled models (TTL-cached)\n│   ├── Vllm         # Queries vLLM /v1/models and /health for model/context discovery\n│   └── System       # Queries OS memory: macOS (vm_stat/sysctl), Linux (/proc/meminfo)\n├── Quality                              # Response quality evaluation\n│   ├── Checker      # Quality heuristics (empty, too_short, repetition, json_parse) + pluggable\n│   ├── ShadowEval   # Parallel shadow evaluation on cheaper models with sampling\n│   └── Confidence/\n│       ├── Score    # Immutable ConfidenceScore value object (score, band, source, signals)\n│       └── Scorer   # Computes ConfidenceScore from logprobs, heuristics, or caller-provided value\n├── Metering                             # Unified token/cost accounting and AMQP event emission\n│   ├── Usage        # Immutable Usage struct (input_tokens, output_tokens, cache tokens)\n│   ├── Pricing      # Model cost estimation with fuzzy matching\n│   ├── Recorder     # Per-request in-memory cost accumulator\n│   └── Tokens       # Thread-safe per-session token budget accumulator\n├── Inference                            # 18-step request/response pipeline\n│   ├── Request      # Data.define struct for unified request representation\n│   ├── Response     # Data.define struct for unified response representation\n│   ├── Profile      # Caller-derived profiles (external/gaia/system) for step skipping\n│   ├── Tracing      # Distributed trace_id, span_id, exchange_id generation\n│   ├── Timeline     # Ordered event recording with participant tracking\n│   ├── Executor     # 18-step skeleton with profile-aware execution and call_stream\n│   ├── Conversation # In-memory LRU (256 slots) + optional Sequel DB persistence\n│   ├── Prompt       # Clean dispatch API: dispatch, request, summarize, extract, decide\n│   ├── ToolDispatcher # Routes tool calls: MCP client / LEX runner / native tool execution\n│   ├── AuditPublisher # Publishes audit events to llm.audit exchange\n│   ├── EnrichmentInjector # Converts RAG/GAIA enrichments into system prompt\n│   └── Steps/       # All 18+ pipeline step modules\n├── Router                               # Dynamic weighted routing engine\n│   ├── Resolution   # Value object: tier, provider, model, rule name, metadata, compress_level\n│   ├── Rule         # Routing rule: intent matching, schedule windows, constraints\n│   ├── HealthTracker # Circuit breaker, latency rolling window, pluggable signal handlers\n│   ├── EscalationChain # Ordered fallback resolution chain with max_attempts cap\n│   ├── Arbitrage    # Cost-aware model selection when no rules match\n│   └── Escalation/\n│       └── History  # EscalationHistory mixin\n├── Fleet                                # Fleet dispatch over AMQP; provider responders live in lex-llm-* gems\n│   ├── Dispatcher   # Fleet RPC dispatch with routing key building, per-type timeouts\n│   ├── TokenIssuer  # Request-side JWT minting for provider-owned responders\n│   └── ReplyDispatcher # Correlation-based reply routing\n├── API                                  # All external HTTP interfaces\n│   ├── Auth         # Config-driven Bearer/x-api-key auth for /v1/ routes\n│   ├── Native/\n│   │   ├── Inference  # POST /api/llm/inference\n│   │   ├── Chat       # POST /api/llm/chat\n│   │   ├── Providers  # GET /api/llm/providers, GET /api/llm/providers/:name\n│   │   ├── Models     # GET /api/llm/models, GET /api/llm/models/:id, GET /api/llm/providers/:name/models\n│   │   └── Helpers    # Shared: parse_request_body, json_response, emit_sse_event\n│   ├── OpenAI/\n│   │   ├── ChatCompletions # POST /v1/chat/completions (streaming via data: [DONE])\n│   │   ├── Models          # GET /v1/models, GET /v1/models/:id\n│   │   └── Embeddings      # POST /v1/embeddings\n│   ├── Anthropic/\n│   │   └── Messages        # POST /v1/messages (streaming via message_start/stop events)\n│   └── Translators/\n│       ├── OpenAIRequest / OpenAIResponse\n│       └── AnthropicRequest / AnthropicResponse\n├── Audit                                # Prompt, tool, and skill audit event emission\n├── Transport                            # Centralized AMQP exchange and non-fleet message definitions\n│   ├── Message      # LLM base message: context propagation, LLM headers\n│   ├── Exchanges/   # Fleet, Metering, Audit, Escalation\n│   └── Messages/    # MeteringEvent, prompt/tool audit, escalation, and compatibility wrappers\n├── Scheduling                           # Deferred execution\n│   ├── Batch        # Non-urgent request batching with priority queue and auto-flush\n│   └── OffPeak      # Peak-hour deferral\n├── Tools                                # Tool call layer\n│   ├── Confidence   # 4-tier degrading confidence storage\n│   ├── Dispatcher   # Routes tool calls to MCP/LEX/native execution\n│   ├── Interceptor  # Extensible pre-dispatch intercept registry\n├── Hooks                                # Before/after chat interceptor registry\n│   ├── RagGuard, ResponseGuard, BudgetGuard, Reflection\n├── Cache                                # Application-level response caching\n│   └── Response     # Async delivery via memcached with spool overflow at 8MB\n├── Skills                               # Daemon-side skill execution subsystem\n│   ├── Base, Registry, Steps::SkillInjector\n└── Helper           # Extension helper mixin (llm_chat, llm_embed, llm_session, compress:)\n```\n\n## Usage in Extensions\n\nAny LEX extension can use LLM capabilities. The gem provides helper methods that are auto-loaded when legion-llm is present.\n\n### Basic Extension Usage\n\n```ruby\nmodule Legion::Extensions::MyLex::Runners\n  module Analyzer\n    def analyze(text:, **_opts)\n      chat = Legion::LLM.chat\n      response = chat.ask(\"Analyze this: #{text}\")\n      { analysis: response.content }\n    end\n  end\nend\n```\n\n### Declaring LLM as Required\n\nExtensions that cannot function without LLM should declare the dependency. Legion will skip loading the extension if LLM is not available:\n\n```ruby\nmodule Legion::Extensions::MyLex\n  def self.llm_required?\n    true\n  end\nend\n```\n\n### Helper Methods\n\nInclude the LLM helper for convenience methods in any runner:\n\n```ruby\n# One-shot chat\nresult = llm_chat(\"Summarize this text\", instructions: \"Be concise\")\n\n# Chat with tools\nresult = llm_chat(\"Check the weather\", tools: [WeatherLookup])\n\n# With prompt compression (reduces input tokens for cost/speed)\nresult = llm_chat(\"Summarize the data\", instructions: \"Be concise\", compress: 2)\n\n# Embeddings\nembedding = llm_embed(\"some text to embed\")\n\n```\n\n### Inference Pipeline\n\n`Legion::LLM.chat` calls that include `message:` or `messages:` flow through `Legion::LLM::Inference`, an 18-step request/response pipeline. The pipeline handles RBAC, classification, RAG context retrieval, MCP tool discovery, metering, billing, audit, and GAIA advisory in a consistent sequence. Steps are skipped based on the caller profile (`:external`, `:gaia`, `:system`).\n\n```ruby\n# Request-shaped calls enter the pipeline\nresult = Legion::LLM.chat(message: \"hello\")\n\n# Session creation does not\nsession = Legion::LLM.chat(model: \"gpt-4o\")\n```\n\nThe pipeline accepts a `caller:` hash describing the request origin:\n\n```ruby\nLegion::LLM.chat(\n  message: \"hello\",\n  caller: { requested_by: { identity: \"user@example.com\", type: :human, credential: :jwt } }\n)\n```\n\nSystem callers (type: `:system`) derive the `:system` profile, which skips governance steps to prevent recursion.\n\n### Routing\n\nlegion-llm includes a dynamic weighted routing engine that dispatches requests across local, fleet, OpenAI-compatible, cloud, and frontier tiers based on caller intent, priority rules, time schedules, cost multipliers, and real-time provider health. Routing is enabled by default; set `routing.enabled: false` to bypass routing and call the configured provider directly.\n\n#### Routing Tiers\n\n```\n┌─────────────────────────────────────────────────────────┐\n│              Legion::LLM Router (per-node)               │\n│                                                          │\n│  Tier 1: LOCAL  → Ollama on this machine (direct HTTP)   │\n│          Zero network overhead, no Transport              │\n│                                                          │\n│  Tier 2: FLEET  → provider-owned lex-llm-* responders     │\n│          Shared lex-llm fleet envelopes over AMQP         │\n│                                                          │\n│  Tier 3: CLOUD  → Bedrock / Azure / Gemini               │\n│  Tier 4: FRONTIER → Anthropic / OpenAI                   │\n│          Existing provider API calls                     │\n└─────────────────────────────────────────────────────────┘\n```\n\n| Tier | Target | Use Case |\n|------|--------|----------|\n| `local` | Ollama on localhost | Privacy-sensitive, offline, or low-latency workloads |\n| `fleet` | Shared hardware via provider-owned lex-llm responders over AMQP | Larger vLLM/Ollama models on dedicated GPU servers |\n| `openai_compat` | OpenAI-compatible provider instances | Self-hosted or proxy endpoints with OpenAI-compatible APIs |\n| `cloud` | API providers (Bedrock, Azure, Gemini) | Managed cloud inference |\n| `frontier` | API providers (Anthropic, OpenAI) | Frontier models, full-capability inference |\n\nFleet dispatch is built into `legion-llm`, but fleet consumption is provider-owned. `Fleet::Dispatcher` publishes shared `lex-llm` protocol-v2 `FleetRequest` envelopes to keys such as `llm.fleet.inference.qwen3-6-27b.ctx32000` or `llm.fleet.embed.nomic-embed-text`; the enabled provider gem actor consumes the request, validates the signed token and idempotency key through `Legion::Extensions::Llm::Fleet::ProviderResponder`, calls its local provider instance through the canonical `lex-llm` provider methods, and replies with shared `FleetResponse` or `FleetError` envelopes. Keep `routing.tiers.fleet.routing_style` set to `shared_lane` for the default pooled model lanes, or set it to `offering_lane` for exact provider-instance lanes such as `llm.fleet.offering.vllm-gpu-01.qwen3-6.inference`.\n\n#### Intent-Based Dispatch\n\nPass an `intent:` hash to route based on privacy, capability, or cost requirements:\n\n```ruby\n# Route to local tier for strict privacy\nresult = llm_chat(\"Summarize this PII data\", intent: { privacy: :strict })\n\n# Route to cloud for reasoning tasks\nresult = llm_chat(\"Solve this proof\", intent: { capability: :reasoning })\n\n# Minimize cost — prefers local/fleet over cloud\nresult = llm_chat(\"Translate this\", intent: { cost: :minimize })\n\n# Explicit tier override (bypasses rules)\nresult = llm_chat(\"Translate this\", tier: :cloud, model: \"claude-sonnet-4-6\")\n```\n\nSame parameters work on `Legion::LLM.chat` and `llm_session`:\n\n```ruby\nchat = Legion::LLM.chat(intent: { privacy: :strict, capability: :basic })\nsession = llm_session(tier: :local)\n```\n\n#### Intent Dimensions\n\n| Dimension | Values | Default | Effect |\n|-----------|--------|---------|--------|\n| `privacy` | `:strict`, `:normal` | `:normal` | `:strict` -\u003e never cloud (via constraint rules) |\n| `capability` | `:basic`, `:moderate`, `:reasoning` | `:moderate` | Higher prefers larger/cloud models |\n| `cost` | `:minimize`, `:normal` | `:normal` | `:minimize` prefers local/fleet |\n\n#### Routing Resolution\n\n```\n1. Caller passes intent: { privacy: :strict, capability: :basic }\n2. Router merges with default_intent (fills missing dimensions)\n3. Load rules from settings, filter by:\n   a. Intent match (all `when` conditions must match)\n   b. Schedule window (valid_from/valid_until, hours, days)\n   c. Constraints (e.g., never_cloud strips cloud-tier rules)\n   d. Discovery (Ollama model pulled? Model fits in available RAM?)\n   e. Tier availability (is Ollama running? is Transport loaded?)\n4. Score remaining candidates:\n   effective_priority = rule.priority\n                      + health_tracker.adjustment(provider)\n                      + (1.0 - cost_multiplier) * 10\n5. Return Resolution for highest-scoring candidate\n```\n\n#### Settings\n\nAdd routing configuration under the `llm` key:\n\n```json\n{\n  \"llm\": {\n    \"routing\": {\n      \"enabled\": true,\n      \"default_intent\": { \"privacy\": \"normal\", \"capability\": \"moderate\", \"cost\": \"normal\" },\n      \"tiers\": {\n        \"local\": { \"provider\": \"ollama\" },\n        \"fleet\": {\n          \"queue\": \"llm.fleet\",\n          \"routing_style\": \"shared_lane\",\n          \"timeout_seconds\": 30,\n          \"timeouts\": { \"embed\": 10, \"chat\": 30, \"generate\": 30, \"default\": 30 }\n        },\n        \"openai_compat\": { \"providers\": [\"openai\"] },\n        \"cloud\": { \"providers\": [\"bedrock\", \"azure\", \"gemini\"] },\n        \"frontier\": { \"providers\": [\"anthropic\", \"openai\"] }\n      },\n      \"health\": {\n        \"window_seconds\": 300,\n        \"circuit_breaker\": { \"failure_threshold\": 3, \"cooldown_seconds\": 60 },\n        \"latency_penalty_threshold_ms\": 5000\n      },\n      \"rules\": [\n        {\n          \"name\": \"privacy_local\",\n          \"when\": { \"privacy\": \"strict\" },\n          \"then\": { \"tier\": \"local\", \"provider\": \"ollama\", \"model\": \"llama3\" },\n          \"priority\": 100,\n          \"constraint\": \"never_cloud\"\n        },\n        {\n          \"name\": \"reasoning_cloud\",\n          \"when\": { \"capability\": \"reasoning\" },\n          \"then\": { \"tier\": \"cloud\", \"provider\": \"bedrock\", \"model\": \"us.anthropic.claude-sonnet-4-6-v1\" },\n          \"priority\": 50,\n          \"cost_multiplier\": 1.0\n        },\n        {\n          \"name\": \"anthropic_promo\",\n          \"when\": { \"cost\": \"normal\" },\n          \"then\": { \"tier\": \"cloud\", \"provider\": \"anthropic\", \"model\": \"claude-sonnet-4-6\" },\n          \"priority\": 60,\n          \"cost_multiplier\": 0.5,\n          \"schedule\": {\n            \"valid_from\": \"2026-03-15T00:00:00\",\n            \"valid_until\": \"2026-03-29T23:59:59\",\n            \"hours\": [\"00:00-06:00\", \"18:00-23:59\"]\n          },\n          \"note\": \"Double token promotion — off-peak hours only\"\n        }\n      ]\n    }\n  }\n}\n```\n\n#### Routing Rules\n\nEach rule is a hash with:\n\n| Field | Type | Required | Description |\n|-------|------|----------|-------------|\n| `name` | String | Yes | Unique rule identifier |\n| `when` | Hash | Yes | Intent conditions to match (`privacy`, `capability`, `cost`) |\n| `then` | Hash | No | Target: `{ tier:, provider:, model: }` |\n| `priority` | Integer | No (default 0) | Higher wins when multiple rules match |\n| `constraint` | String | No | Hard constraint (e.g., `never_cloud`) |\n| `fallback` | String | No | Fallback tier if primary is unavailable |\n| `cost_multiplier` | Float | No (default 1.0) | Lower = cheaper = routing bonus |\n| `schedule` | Hash | No | Time-based activation window |\n| `note` | String | No | Human-readable note |\n\n#### Health Tracking\n\nThe `HealthTracker` adjusts effective priorities at runtime based on provider health signals:\n\n- **Circuit breaker**: After consecutive failures, a provider's circuit opens (penalty: -50) then transitions to half_open (penalty: -25) after a cooldown period\n- **Latency penalty**: Rolling window tracks average latency; providers above threshold receive priority penalties\n- **Pluggable signals**: Any LEX can feed custom signals (e.g., GPU utilization, budget tracking) via `register_handler`\n\n```ruby\n# Report signals (typically called by LEX extensions)\ntracker = Legion::LLM::Router.health_tracker\ntracker.report(provider: :anthropic, signal: :error, value: 1)\ntracker.report(provider: :ollama, signal: :latency, value: 1200)\n\n# Check state\ntracker.circuit_state(:anthropic)  # -\u003e :closed, :open, or :half_open\ntracker.adjustment(:anthropic)     # -\u003e Integer (priority offset)\n\n# Add custom signal handler\ntracker.register_handler(:gpu_utilization) { |data| ... }\n```\n\nWhen routing is disabled, `chat`, `llm_chat`, and `llm_session` bypass route resolution and behave like direct provider calls.\n\n#### Local Model Discovery\n\nWhen the Ollama provider is enabled, legion-llm discovers which models are actually pulled and checks available system memory before routing to local models. When vLLM is enabled, legion-llm discovers `/v1/models`, records `max_model_len` as the model context window, and checks `/health` for provider availability. This prevents the router from selecting models that are not installed, unhealthy, or too large for the requested context.\n\nDiscovery uses lazy TTL-based caching (default: 60 seconds). At startup, caches are warmed and logged:\n\n```\nOllama: 3 models available (llama3.1:8b, qwen2.5:32b, nomic-embed-text)\nvLLM: 1 model available (qwen3.6-27b ctx=32000)\nSystem: 65536 MB total, 42000 MB available\n```\n\nConfigure under `discovery`:\n\n```json\n{\n  \"llm\": {\n    \"discovery\": {\n      \"enabled\": true,\n      \"refresh_seconds\": 60,\n      \"memory_floor_mb\": 2048\n    }\n  }\n}\n```\n\n| Key | Type | Default | Description |\n|-----|------|---------|-------------|\n| `enabled` | Boolean | `true` | Master switch for discovery checks |\n| `refresh_seconds` | Integer | `60` | TTL for discovery caches |\n| `memory_floor_mb` | Integer | `2048` | Minimum free MB to reserve for OS |\n\nWhen a routing rule targets a local Ollama model that isn't pulled or won't fit in available memory (minus `memory_floor_mb`), the rule is silently skipped and the next best candidate is used. If discovery fails (Ollama not running, unknown OS), checks are bypassed permissively.\n\n### Model Escalation\n\nWhen an LLM call fails (API error, timeout, or quality issue), the escalation system automatically retries with more capable models. If all attempts fail, `Legion::LLM::EscalationExhausted` is raised.\n\n```ruby\n# Enable escalation and ask in one call\nresponse = Legion::LLM.chat(\n  message: \"Generate a SQL query for user analytics\",\n  escalate: true,\n  max_escalations: 3,\n  quality_check: -\u003e(r) { r.content.include?('SELECT') }\n)\n\n# Check if escalation occurred (true only when more than one attempt was made)\nresponse.escalated?          # =\u003e true if \u003e1 attempt was made\nresponse.escalation_history  # =\u003e [{model:, provider:, tier:, outcome:, failures:, duration_ms:}, ...]\nresponse.final_resolution    # =\u003e Resolution that succeeded\nresponse.escalation_chain    # =\u003e EscalationChain used for this call\n```\n\nRaises `Legion::LLM::EscalationExhausted` if all attempts are exhausted.\n\nConfigure globally in settings:\n\n```yaml\nllm:\n  routing:\n    escalation:\n      enabled: true\n      max_attempts: 3\n      quality_threshold: 50\n```\n\n### Prompt Compression\n\n`Legion::LLM::Context::Compressor` strips low-signal words from prompts before sending to the API, reducing input token count and cost. Compression is deterministic (same input always produces the same output), preserving prompt caching compatibility.\n\n#### Levels\n\n| Level | Name | What It Removes |\n|-------|------|-----------------|\n| 0 | None | Nothing |\n| 1 | Light | Articles (a, an, the), filler adverbs (just, very, really, basically, ...) |\n| 2 | Moderate | + sentence connectives (however, moreover, furthermore, ...) |\n| 3 | Aggressive | + low-signal words (also, then, please, note, that, ...) + whitespace normalization |\n\nCode blocks (fenced and inline) are never modified. Negation words are never removed.\n\n#### Usage\n\n```ruby\n# Direct API\ntext = Legion::LLM::Context::Compressor.compress(\"The very important system prompt\", level: 2)\n\n# Via llm_chat helper (compresses both message and instructions)\nresult = llm_chat(\"Analyze the data\", instructions: \"Be very concise\", compress: 2)\n```\n\n#### Router Integration\n\nRouting rules can specify `compress_level` in their target to auto-compress for cost-sensitive tiers:\n\n```json\n{\n  \"name\": \"cloud_compressed\",\n  \"priority\": 50,\n  \"when\": { \"capability\": \"chat\" },\n  \"then\": { \"tier\": \"cloud\", \"provider\": \"bedrock\", \"model\": \"claude-sonnet-4-6\", \"compress_level\": 2 }\n}\n```\n\n### Building an LLM-Powered LEX\n\nA complete example of a LEX extension that uses LLM for intelligent processing:\n\n```ruby\n# lib/legion/extensions/smart_alerts/runners/evaluate.rb\nmodule Legion::Extensions::SmartAlerts::Runners\n  module Evaluate\n    def evaluate(alert_data:, **_opts)\n      session = llm_session(model: 'us.anthropic.claude-sonnet-4-6-v1')\n      session.with_instructions(\u003c\u003c~PROMPT)\n        You are an alert triage system. Given alert data, determine:\n        1. Severity (critical, warning, info)\n        2. Whether it requires immediate human attention\n        3. Suggested remediation steps\n      PROMPT\n\n      result = session.ask(\"Evaluate this alert: #{alert_data.to_json}\")\n\n      {\n        evaluation: result.content,\n        timestamp: Time.now.utc,\n        model: 'us.anthropic.claude-sonnet-4-6-v1'\n      }\n    end\n  end\nend\n```\n\n## Backward Compatibility\n\nv0.8.0 reorganizes modules extensively. All old constant names continue to work via `lib/legion/llm/compat.rb`, which uses `const_missing` to resolve old names to new locations and emits a deprecation warning on first access.\n\nKey aliases:\n\n| Old Name | New Name |\n|----------|----------|\n| `Legion::LLM::Pipeline` | `Legion::LLM::Inference` |\n| `Legion::LLM::ConversationStore` | `Legion::LLM::Inference::Conversation` |\n| `Legion::LLM::NativeDispatch` | `Legion::LLM::Call::Dispatch` |\n| `Legion::LLM::ProviderRegistry` | `Legion::LLM::Call::Registry` |\n| `Legion::LLM::CostEstimator` | `Legion::LLM::Metering::Pricing` |\n| `Legion::LLM::CostTracker` | `Legion::LLM::Metering::Recorder` |\n| `Legion::LLM::TokenTracker` | `Legion::LLM::Metering::Tokens` |\n| `Legion::LLM::QualityChecker` | `Legion::LLM::Quality::Checker` |\n| `Legion::LLM::Compressor` | `Legion::LLM::Context::Compressor` |\n| `Legion::LLM::ResponseCache` | `Legion::LLM::Cache::Response` |\n| `Legion::LLM::DaemonClient` | `Legion::LLM::Call::DaemonClient` |\n| `Legion::LLM::ShadowEval` | `Legion::LLM::Quality::ShadowEval` |\n\nNo code changes are needed in consumers immediately. The aliases will be maintained for at least one major version.\n\n## Providers\n\n| Provider | Config Key | Credential Source | Notes |\n|----------|-----------|-------------------|-------|\n| AWS Bedrock | `bedrock` | `vault://`, `env://`, or direct | Default region: us-east-2, SigV4 or Bearer Token auth |\n| Anthropic | `anthropic` | `vault://`, `env://`, or direct | Direct API access |\n| OpenAI | `openai` | `vault://`, `env://`, or direct | GPT models |\n| Google Gemini | `gemini` | `vault://`, `env://`, or direct | Gemini models |\n| Azure AI | `azure` | `vault://`, `env://`, or direct | Azure OpenAI endpoint; `api_base` + `api_key` or `auth_token` |\n| Ollama | `ollama` | Local, no credentials needed | Local inference and embeddings |\n| vLLM | `vllm` | Optional API key | OpenAI-compatible local/fleet inference with `/health` and `/v1/models` discovery |\n| MLX | `mlx` | Optional API key | Local Apple Silicon inference through lex-llm provider adapters |\n\n`env://NAME` credential placeholders resolve at provider configuration time, including array fallbacks such as `[\"env://OPENAI_API_KEY\", \"env://CODEX_API_KEY\"]`. Unresolved placeholders do not auto-enable hosted providers.\n\n## Integration with LegionIO\n\nlegion-llm follows the standard core gem lifecycle:\n\n```\nLegion::Service#initialize\n  ...\n  setup_data           # Legion::Data\n  setup_llm            # Legion::LLM  \u003c-- here\n  setup_supervision    # Legion::Supervision\n  load_extensions      # LEX extensions (can use LLM if available)\n```\n\nLegionIO hosts these routes through `mount_library_routes('llm', Routes::Llm, 'Legion::LLM::Routes')`. The route modules remain owned by `legion-llm`; LegionIO no longer registers provider gateway fallback routes when the library is available.\n\n- **Service**: `setup_llm` called between data and supervision in startup sequence\n- **Extensions**: `llm_required?` method on extension module, checked at load time\n- **Helpers**: `Legion::Extensions::Helpers::LLM` auto-loaded when gem is present\n- **Readiness**: Registers as `:llm` in `Legion::Readiness`\n- **Shutdown**: `Legion::LLM.shutdown` called during service shutdown (reverse order)\n\n## Development\n\n```bash\ngit clone https://github.com/LegionIO/legion-llm.git\ncd legion-llm\nbundle install\nbundle exec rspec --format json --out tmp/rspec_results.json --format progress --out tmp/rspec_progress.txt\n```\n\n### Running Tests\n\nTests run against real `Legion::Logging` and `Legion::Settings` implementations (hard dependencies, never stubbed). Each test resets settings to defaults via `before(:each)`. No full LegionIO stack required.\n\n```bash\nbundle exec rspec --format json --out tmp/rspec_results.json --format progress --out tmp/rspec_progress.txt\nbundle exec rubocop -A\n```\n\n## Dependencies\n\n| Gem | Purpose |\n|-----|---------|\n| `concurrent-ruby` | Thread-safe primitives for routing and fleet coordination |\n| `faraday` | HTTP client for provider and API calls |\n| `legion-cache` | Shared and local cache integration |\n| `legion-json` | Legion JSON serialization |\n| `legion-logging` | Logging |\n| `legion-settings` | Configuration defaults and file overrides |\n| `legion-transport` (\u003e= 1.4.14) | AMQP transport for fleet dispatch, metering, and audit |\n| `lex-knowledge` | Optional knowledge chunking integration when loaded |\n| `lex-llm` (\u003e= 0.4.3) | Provider-neutral contract, model offerings, response normalization, fleet envelopes, and responder-side fleet execution helpers |\n| `pdf-reader` | PDF extraction support |\n| `tzinfo` (\u003e= 2.0) | IANA timezone conversion for schedule windows |\n\n## License\n\nApache-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flegionio%2Flegion-llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flegionio%2Flegion-llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flegionio%2Flegion-llm/lists"}