https://github.com/kckempf/yallmap

An OpenTelemetry-instrumented gateway for Anthropic-compatible LLMs
https://github.com/kckempf/yallmap
ai-gateway anthropic claude claude-code langfuse llm-gateway llm-observability llm-proxy ollama opentelemetry otel typescript
Last synced: 28 days ago
JSON representation
An OpenTelemetry-instrumented gateway for Anthropic-compatible LLMs
Host: GitHub
URL: https://github.com/kckempf/yallmap
Owner: kckempf
License: mit
Created: 2026-05-27T21:16:33.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-06-04T21:11:04.000Z (about 1 month ago)
Last Synced: 2026-06-04T22:15:01.284Z (about 1 month ago)
Topics: ai-gateway, anthropic, claude, claude-code, langfuse, llm-gateway, llm-observability, llm-proxy, ollama, opentelemetry, otel, typescript
Language: TypeScript
Homepage: http://grokkist.com
Size: 527 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
Awesome Lists containing this project

README

          # yallmap

![CI](https://github.com/kckempf/yallmap/actions/workflows/ci.yml/badge.svg)

![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)

![Node](https://img.shields.io/badge/node-%3E%3D20-blue)

An OpenTelemetry-instrumented gateway for Anthropic-compatible LLMs. Drop it in front

of [Claude Code](https://claude.ai/code) or any Anthropic SDK client to get per-request

token tracking, cost attribution, and latency observability in

[Langfuse](https://langfuse.com) — no client changes required.

## Why this exists

LiteLLM, Helicone, and Portkey are already out there, making this Yet Another LLM Proxy (YALLMAP). This is a different point in the design space: TypeScript-native, Anthropic API-first (not OpenAI-shaped), with routing rules expressed as typed functions instead of YAML or CEL. Built around Claude Code as a primary client, optimized for streaming and tool-use.  It's the LLM proxy that I need, so I've built it and shared it, as I can't be the only one who works the way I do.

## Try it in 2 minutes (if you have Claude Code installed)

```bash

git clone https://github.com/kckempf/yallmap && cd yallmap

npm install

npm run dev                          # starts on :3001

# in another shell:

ANTHROPIC_BASE_URL=http://localhost:3001 claude

```

Langfuse is **optional** — the gateway works without it; you just lose the telemetry

half. The telemetry exporter warns on startup if `OTEL_EXPORTER_OTLP_ENDPOINT` is unset,

then continues running normally.

## Status

**v0.8** — Production hardening. See roadmap below.

## How it works

```text

Claude Code ──► yallmap :3001 ──► api.anthropic.com

                     │          └───► Ollama (ollama/* models)

                     │

                     └──► Langfuse (via OTLP)

                          gen_ai.system

                          gen_ai.request.model

                          gen_ai.usage.input_tokens

                          gen_ai.usage.output_tokens

                          gen_ai.response.finish_reasons

```

Every request to `POST /v1/messages` is routed to the appropriate provider based on

TypeScript routing rules. SSE streaming is piped through without buffering. A transform

stream reads SSE events in-flight to extract token usage, emitted as a `gen_ai.request`

span when the response completes.

Ollama requests are automatically translated between the Anthropic Messages API format

and Ollama's OpenAI-compatible API — the client always speaks Anthropic.

## Prerequisites

- Node.js 20+

- A running [Langfuse](https://langfuse.com/docs/deployment/self-host) instance

  (Docker Compose quickstart: `docker-compose up -d` from the Langfuse repo)

- [Ollama](https://ollama.ai) (optional — only needed for `ollama/*` model routing)

## Setup

```bash

npm install

cp .env.example .env

```

Edit `.env`:

```env

# Assuming Langfuse is running on port 3000

PORT=3001

# Langfuse OTLP endpoint

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:3000/api/public/otel/v1/traces

# Langfuse project keys (Settings → API Keys)

LANGFUSE_PUBLIC_KEY=pk-lf-...

LANGFUSE_SECRET_KEY=sk-lf-...

# Optional overrides (defaults shown)

# ANTHROPIC_BASE_URL=https://api.anthropic.com

# OLLAMA_BASE_URL=http://localhost:11434

```

## Running

```bash

# Development (watch mode, loads .env)

npm run dev

# Production

npm run build

npm start

```

## Pointing Claude Code at the gateway

```bash

ANTHROPIC_BASE_URL=http://localhost:3001 claude

```

Or export it in your shell profile to make it permanent.

## Authentication

The gateway supports a multi-key allowlist with per-key identity. Configure clients

via the `GATEWAY_API_KEYS` environment variable as comma-separated `label:secret`

pairs:

```env

GATEWAY_API_KEYS=alice:abc123,bob:def456,ci:ghj789

```

Clients send their secret in the `x-gateway-key` header (separate from Anthropic's

`x-api-key`, which is forwarded upstream untouched):

```bash

curl -X POST http://localhost:3001/v1/messages \

  -H 'x-gateway-key: abc123' \

  -H "x-api-key: $ANTHROPIC_API_KEY" \

  -H 'anthropic-version: 2023-06-01' \

  -H 'content-type: application/json' \

  -d '{"model":"claude-sonnet-4-6","max_tokens":100,"messages":[{"role":"user","content":"hi"}]}'

```

From the Anthropic SDK, pass the header via `defaultHeaders`:

```typescript

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic({

  baseURL: 'http://localhost:3001',

  defaultHeaders: { 'x-gateway-key': process.env.GATEWAY_KEY },

});

```

The authenticated `keyId` (the label half of the pair) propagates downstream:

- **Request logs** — appears as `keyId` in the structured JSON log line.

- **Rate limiting** — the default `rateLimit` keys on `keyId` when set.

- **Langfuse traces** — emitted as the `user.id` span attribute, surfaced as the

  user filter in the Langfuse UI.

When `GATEWAY_API_KEYS` is unset, the gateway runs **unauthenticated**. This keeps

the 60-second quickstart frictionless but is unsafe for any deployment with a

public endpoint. A warning is logged on startup if `NODE_ENV=production` and no

keys are configured.

## Middleware

Middleware runs before every upstream call. It can inspect or modify the request, reject

it early, or observe the response. Middleware is configured in `src/middleware/config.ts`.

```typescript

import { costGuard, rateLimit, piiRedactor } from './index';

export const middlewares: MiddlewareFn[] = [

  costGuard(0.10),                              // reject if worst-case cost > $0.10

  rateLimit({ requests: 100, windowMs: 60_000 }),  // 100 req/min per API key

  piiRedactor([/\b\d{3}-\d{2}-\d{4}\b/g]),     // redact SSNs from messages

];

```

### Built-in middleware

| Factory | Description |

| --- | --- |

| `costGuard(limitUsd)` | Rejects with 429 when worst-case cost (model × max_tokens) exceeds `limitUsd`. Uses the built-in pricing table; unknown models pass through. |

| `apiKeyAuth({ keys, headerName? })` | Allowlist authentication. Rejects with 401 if `x-gateway-key` is missing or unknown. On success, sets `ctx.auth.keyId` for downstream middleware. See [Authentication](#authentication). |

| `rateLimit({ requests, windowMs, keyFn? })` | In-memory fixed-window counter. Keys on `ctx.auth.keyId` when present, otherwise `x-api-key`. Override with `keyFn`. **State is per-process and resets on restart — do not deploy behind a load balancer without a shared store.** |

| `piiRedactor(patterns, replacement?)` | Regex-replaces matches in message `text` content blocks before forwarding. |

### Writing custom middleware

Middleware is a `(ctx, next) => Promise` function:

```typescript

import type { MiddlewareFn } from './types';

const myMiddleware: MiddlewareFn = async (ctx, next) => {

  // inspect: ctx.model, ctx.maxTokens, ctx.body, ctx.clientHeaders

  if (ctx.model.startsWith('claude-opus')) {

    return new Response(JSON.stringify({ type: 'error', error: { type: 'forbidden', message: 'Opus not allowed' } }), {

      status: 403, headers: { 'content-type': 'application/json' },

    });

  }

  return next();  // or: const res = await next(); then inspect res

};

```

## Routing

Routing rules live in `src/routing/config.ts`. Rules are TypeScript functions —

no YAML, no DSL.

```typescript

import { firstMatch, whenModel, chain, anthropic, ollama } from './index';

export const router = firstMatch([

  // Route ollama/* models to local Ollama, fall back to Anthropic if unavailable

  whenModel(/^ollama\//i, chain(ollama, anthropic)),

]);

```

### Helpers

| Helper | Description |

| --- | --- |

| `whenModel(pattern, provider)` | Match on model name (string or regex) |

| `chain(p1, p2, ...)` | Try providers left-to-right; fall back on 5xx or network error |

| `firstMatch(rules, fallback?)` | Evaluate rules top-to-bottom; first match wins |

### Fallback behaviour

When a provider list is returned (via `chain`), the proxy tries each in order:

- **429 / 503 / 529** — retry the same provider with exponential backoff (see [Retries](#retries))

- **Other 5xx** — drain the body, try the next provider immediately

- **Network error** — try the next provider immediately

- **4xx** — forward to the client immediately (no retry)

- **All providers exhausted** — return 502

## Agent sessions

When an agent makes many LLM calls in a loop, the gateway can correlate them into a

single session in Langfuse using either of two mechanisms:

### `x-session-id` header — simple loops

Set the same UUID on every call in an agent run. The gateway attaches it as a `session.id` span attribute (standard OTel; also recognised

by Langfuse) and strips the header before forwarding to upstream.

```typescript

import Anthropic from '@anthropic-ai/sdk';

import { randomUUID } from 'crypto';

const client = new Anthropic({ baseURL: 'http://localhost:3001' });

const sessionId = randomUUID();

for (const step of agentSteps) {

  await client.messages.create(step, {

    headers: { 'x-session-id': sessionId },

  });

}

```

### W3C `traceparent` — OTel-instrumented frameworks

If your agent framework (LangChain, CrewAI, custom OTel setup) propagates W3C trace

context, the gateway automatically nests its `gen_ai.request` spans as children of the

incoming trace. No code changes needed on the client side.

## Retries

The proxy retries 429 (rate limited), 503 (service unavailable), and 529 (Anthropic

overloaded) on the same provider before falling back to the next one.

**Backoff**: full jitter — `random(0, baseDelay × 2^attempt)`. If the upstream sends a

`Retry-After` header (≤ 60 s), that value is used instead.

**Environment variables:**

| Variable | Default | Description |

| --- | --- | --- |

| `MAX_RETRIES` | `3` | Per-provider retry attempts |

| `RETRY_BASE_DELAY_MS` | `1000` | Base delay for backoff (ms) |

## Upstream timeouts

The gateway uses [`undici`](https://github.com/nodejs/undici) for upstream calls.

Two granular timeouts cap how long we wait for an upstream provider. Both also

respond to per-request `AbortSignal` cancellation, so a client disconnect cancels

the upstream call in flight.

| Variable | Default | Description |

| --- | --- | --- |

| `UPSTREAM_HEADERS_TIMEOUT_MS` | `30000` | Time to wait for the first response byte (ms). Mirrors undici's `headersTimeout`. |

| `UPSTREAM_BODY_TIMEOUT_MS` | `300000` | Time to wait for the full response body (ms). Mirrors undici's `bodyTimeout`. |

## Graceful shutdown

On `SIGTERM` or `SIGINT` the gateway:

1. Aborts in-flight upstream calls (the per-request `AbortSignal` is chained to a

   process-wide shutdown signal).

2. Stops accepting new connections via `server.close()`.

3. If `server.close()` hasn't returned within `SHUTDOWN_TIMEOUT_MS`, forces sockets

   shut with `server.closeAllConnections()`.

4. Flushes the OpenTelemetry SDK so trailing spans reach Langfuse.

5. Exits.

The default 25 s timeout stays under ECS's 30 s `SIGKILL` window so the orchestrator

sees a clean exit during rolling deploys.

| Variable | Default | Description |

| --- | --- | --- |

| `SHUTDOWN_TIMEOUT_MS` | `25000` | Max drain time before sockets are force-closed (ms). |

## Logging

Request logs are written as structured JSON to stdout — compatible with CloudWatch,

Datadog, or any log aggregation tool.

```json

{"level":30,"time":1748470913,"requestId":"a3f7b912","method":"POST","path":"/v1/messages",

 "status":200,"latencyMs":487,"model":"claude-sonnet-4-6","provider":"anthropic",

 "inputTokens":343,"outputTokens":13,"costUsd":0.000224}

```

In development (`NODE_ENV=development`), set `LOG_LEVEL=debug` and logs are formatted

with `pino-pretty` for readability.

**Environment variables:**

| Variable | Default | Description |

| --- | --- | --- |

| `LOG_LEVEL` | `info` | `trace` \| `debug` \| `info` \| `warn` \| `error` \| `fatal` |

| `CAPTURE_CONTENT` | _(unset)_ | Set to `true` to record prompt and completion in Langfuse traces (`gen_ai.prompt` / `gen_ai.completion` span attributes). Off by default — message content stays out of telemetry. |

## Cost tracking

The `gen_ai.usage.cost_usd` span attribute is set on every non-streaming response where

the model is in the pricing table. Cost also appears in the request log as `costUsd`.

Pricing data lives in `src/pricing/anthropic.ts` (auto-generated). To refresh it:

```bash

npm run update-pricing

```

The script fetches the [LiteLLM community pricing registry](https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json),

validates the schema, prints a human-readable diff, and regenerates the file. It exits

non-zero if the upstream schema changes in a breaking way, so CI fails loudly.

A GitHub Actions workflow (`.github/workflows/update-pricing.yml`) runs this every

Monday and opens a PR when prices change.

## What you see in Langfuse

![Langfuse trace](docs/langfuse-trace.png "Screenshot of a langfuse trace including Metadata attributes")

Each request produces a `gen_ai.request` span with:

| Attribute | Example |

| --- | --- |

| `gen_ai.system` | `anthropic` or `ollama` |

| `gen_ai.request.model` | `claude-sonnet-4-6` |

| `gen_ai.request.max_tokens` | `32000` |

| `gen_ai.response.model` | `claude-sonnet-4-6` |

| `gen_ai.usage.input_tokens` | `343` |

| `gen_ai.usage.output_tokens` | `13` |

| `gen_ai.usage.cost_usd` | `0.000224` |

| `gen_ai.response.finish_reasons` | `["end_turn"]` |

| `gen_ai.prompt` | `[{"role":"user","content":"Hello"}]` _(opt-in: `CAPTURE_CONTENT=true`)_ |

| `gen_ai.completion` | `[{"type":"text","text":"Hi there"}]` _(opt-in: `CAPTURE_CONTENT=true`)_ |

`gen_ai.system` reflects the provider that actually handled the request — useful for

distinguishing local vs. cloud inference in Langfuse dashboards.

## Docker

```bash

docker build -t yallmap .

docker run -p 3001:3001 \

  -e OTEL_EXPORTER_OTLP_ENDPOINT=http://host.docker.internal:3000/api/public/otel/v1/traces \

  -e LANGFUSE_PUBLIC_KEY=pk-lf-... \

  -e LANGFUSE_SECRET_KEY=sk-lf-... \

  yallmap

```

The multi-stage Dockerfile builds in `node:22-alpine`, copies only compiled output into

the final image. No dev dependencies or TypeScript source in the production image.

For AWS deployment, see the companion CDK construct:

[cdk-yallmap](https://github.com/kevinkempf/cdk-yallmap).

## Adding a provider

Implement `ProviderAdapter` from `src/adapters/types.ts`:

```typescript

// src/adapters/my-provider.ts

import type { ProviderAdapter } from './types';

export const myProviderAdapter: ProviderAdapter = {

  path: '/v1/chat/completions',           // upstream path

  translateRequest: (body) => { /* ... */ return translated; },

  translateResponse: (body) => { /* ... */ return translated; },

  createStreamTranslator: () => new MyStreamTransform(),

};

```

Then add a `Provider` entry in `src/routing/index.ts` and reference it in

`src/routing/config.ts`. The existing Ollama adapter is the reference implementation.

## Design decisions

**Provider adapters as a formal interface.** `ProviderAdapter` defines the three

translation surfaces — request body, response body, SSE stream — so new providers are

drop-in files with no changes to the router or proxy. The `anthropicAdapter` is an

identity pass-through; the `ollamaAdapter` is the reference implementation of a full

translation.

**Routing policies as TypeScript functions.** Rules are typed predicates — `whenModel`,

`chain`, `firstMatch`. No YAML DSL, no CEL expressions. Adding a rule is adding a line

of code with full type safety and IDE autocomplete.

**Anthropic API surface preserved end-to-end.** Ollama uses an OpenAI-compatible API;

the gateway translates requests and responses transparently so all clients speak the

Anthropic Messages API regardless of which provider handles the request.

**SSE never buffered.** The streaming response is piped through a Transform stream that

reads events in-flight. The client receives bytes as they arrive; nothing is held in

memory waiting for the response to complete.

**OTel Gen AI semantic conventions.** Spans use the

[`gen_ai.*` attribute namespace](https://opentelemetry.io/docs/specs/semconv/gen-ai/)

so traces are interoperable with any OTel-compatible backend, not just Langfuse.

**`accept-encoding: identity` enforced upstream.** Compressed responses can't be parsed

for telemetry. The gateway requests uncompressed from upstream and forwards uncompressed

to the client.

**Middleware as a compile-time chain.** Middleware is a list of typed

`(ctx, next) => Promise` functions composed at startup. Each function either

calls `next()` to continue or returns its own Response to short-circuit. This keeps the

proxy loop clean — policy decisions (cost guards, rate limiting, PII redaction) live

outside the retry/fallback logic and are trivially testable in isolation.

## Roadmap

- [x] v0.1 — transparent Anthropic proxy + OTel observability

- [x] v0.2 — TypeScript routing policies, Ollama adapter, fallback chains

- [x] v0.3 — cost tracking, exponential retry with backoff, structured pino logging

- [x] v0.4 — CDK construct for ECS Fargate deployment ([cdk-yallmap](https://github.com/kevinkempf/cdk-yallmap))

- [x] v0.5 — formalized `ProviderAdapter` interface; drop-in provider plugins; agent session groundwork (`x-session-id`, W3C trace context)

- [x] v0.6 — compile-time middleware chain (`costGuard`, `rateLimit`, `piiRedactor`; custom middleware support); opt-in content capture (`CAPTURE_CONTENT`)

- [x] v0.7 — multi-key auth (`apiKeyAuth`) with identity propagation to rate limit, logs, and Langfuse `user.id`; request body-size limit; clamped `Retry-After` + equal-jitter backoff; OSS publication scaffolding (LICENSE, CONTRIBUTING, CI, Dependabot)

- [x] v0.8 — production hardening: streaming body-size cap, configurable upstream timeouts + per-request `AbortSignal`, graceful HTTP shutdown

- [ ] v0.9 — TBD (candidates: persistent rate-limit state, per-key cost budgets, Prometheus `/metrics`)

## License

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kckempf/yallmap

Awesome Lists containing this project

README