{"id":50680295,"url":"https://github.com/hyparam/collectivus","last_synced_at":"2026-06-08T18:03:40.918Z","repository":{"id":351633642,"uuid":"1211822284","full_name":"hyparam/collectivus","owner":"hyparam","description":"Zero-dependency OTLP/HTTP JSON collector that writes to JSONL files","archived":false,"fork":false,"pushed_at":"2026-05-14T23:15:53.000Z","size":1338,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-05-14T23:25:57.697Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyparam.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-04-15T19:31:03.000Z","updated_at":"2026-05-14T21:15:32.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hyparam/collectivus","commit_stats":null,"previous_names":["hyparam/collectivus"],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/hyparam/collectivus","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcollectivus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcollectivus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcollectivus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcollectivus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyparam","download_url":"https://codeload.github.com/hyparam/collectivus/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fcollectivus/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34073810,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-08T18:03:39.998Z","updated_at":"2026-06-08T18:03:40.907Z","avatar_url":"https://github.com/hyparam.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Collectivus\n\n![collectivus](collectivus.jpg)\n\n[![npm](https://img.shields.io/npm/v/collectivus)](https://www.npmjs.com/package/collectivus)\n[![minzipped](https://img.shields.io/bundlephobia/minzip/collectivus)](https://www.npmjs.com/package/collectivus)\n[![workflow status](https://github.com/hyparam/collectivus/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/collectivus/actions)\n[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)\n[![container](https://img.shields.io/badge/container-ghcr.io%2Fhyparam%2Fcollectivus-blue)](https://github.com/orgs/hyparam/packages/container/package/collectivus)\n\nCollectivus records AI-agent and application telemetry into local files you\ncan query. Run it as a transparent LLM proxy for Claude Code, Codex,\nAnthropic, and OpenAI-compatible APIs; accept OTLP traces, metrics, and logs;\nsubscribe to gascity supervisor transcripts; or register arbitrary JSONL as a\nSQL table. The default path is Standalone: config, recordings, and query cache\nstay on this machine.\n\n- **LLM proxy capture**: full request/response and SSE event recordings for\n  Claude Code, Codex, Anthropic, and OpenAI-compatible APIs.\n- **Agent transcripts**: gascity supervisor capture writes one queryable row\n  per content block with agent identity and token usage.\n- **Application telemetry**: OTLP traces, metrics, and logs over HTTP,\n  normalized to JSONL by signal and service.\n- **Local query**: Iceberg-backed `ctvs query` commands and SQL over\n  `logs`, `traces`, `metrics`, `proxy_messages`, `gascity_messages`, and\n  registered JSONL collections.\n- **Optional operations path**: export to local Parquet, archive daily\n  snapshots to S3, or run Gateway/Central server deployments when many hosts\n  need one control plane.\n\n## Quick start: record Claude Code\n\nThe fastest path is the `npx` walkthrough:\n\n```bash\nnpx collectivus\n```\n\nChoose Claude Code when asked what to collect, or press Enter to collect all\navailable sources. The walkthrough writes `~/.hyp/collectivus.json`, stores\nrecordings under `~/.hyp/collectivus/`, installs a background daemon, and\nattaches Claude Code when selected.\n\nTo run the proxy in the foreground with an existing config:\n\n```bash\nnpx collectivus --config ~/.hyp/collectivus.json\n```\n\nThen point Claude Code at it from another terminal:\n\n```bash\nANTHROPIC_BASE_URL=http://127.0.0.1:8787 claude\n```\n\nAfter a prompt, inspect the JSONL recording:\n\n```bash\ntail -f \"$HOME/.hyp/collectivus/$USER/proxy/$(date -u +%F).jsonl\"\n```\n\nOr write a config by hand:\n\n```bash\n# 1. Save examples/claude-code.json (proxy on 127.0.0.1:8787 → api.anthropic.com)\nnpx collectivus --config examples/claude-code.json\n\n# 2. In another terminal:\nANTHROPIC_BASE_URL=http://127.0.0.1:8787 claude\n\n# 3. Run a prompt, then watch the recording:\ntail -f \"collectivus-data/$USER/proxy/$(date -u +%F).jsonl\"\n```\n\nFull step-by-step: [`docs/walkthrough-claude-code.md`](docs/walkthrough-claude-code.md).\n\nTo capture agent-attributed transcripts from a gascity supervisor (separate\nfrom the proxy capture above), attach a city to the same daemon:\n\n```bash\nnpx collectivus gascity attach hyptown\nnpx collectivus query sql \"select gascity_template, count(*) as parts from gascity_messages group by 1 order by parts desc\"\n```\n\nSee [Gascity source (`gascity_messages`)](#gascity-source-gascity_messages) below.\n\n## Installation\n\nUse `npx collectivus` for first-run setup and foreground CLI commands:\n\n```bash\nnpx collectivus --help\n```\n\nThe package also publishes the shorter `ctvs` binary. Install globally only\nwhen you want to run `ctvs` directly:\n\n```bash\nnpm install -g collectivus\nctvs install --config ~/.hyp/collectivus.json\n```\n\nInstall as a project dependency when you want the programmatic API:\n\n```bash\nnpm install collectivus\n```\n\nThe GHCR image is available for containerized Standalone, Gateway, Central\nserver, and rendezvous deployments. See\n[Advanced deployments](#advanced-deployments) when you need that path.\n\n## Configuration\n\nPass a JSON config with `--config \u003cpath\u003e` (a local path or url). The schema:\n\n```json\n{\n  \"version\": 1,\n  \"otel\":  { \"listen\": \"0.0.0.0:4318\" },\n  \"proxy\": {\n    \"listen\": \"127.0.0.1:8787\",\n    \"upstreams\": [\n      {\n        \"name\": \"anthropic\",\n        \"base_url\": \"https://api.anthropic.com\",\n        \"match\": { \"path_prefix\": \"/v1/messages\" }\n      }\n    ],\n    \"redact_headers\": [\"authorization\", \"x-api-key\", \"anthropic-api-key\", \"cookie\", \"set-cookie\"]\n  },\n  \"sink\": { \"type\": \"file\", \"dir\": \"./collectivus-data\" },\n  \"query\": { \"cache\": { \"enabled\": true } }\n}\n```\n\n| Block | Purpose |\n|-------|---------|\n| `version` | Schema version. Required. Currently `1`. |\n| `otel`    | Enable the OTLP receiver. Omit to disable. |\n| `proxy`   | Enable the LLM proxy. Omit to disable. Requires `sink` in Standalone mode. |\n| `sink`    | Root directory for Standalone JSONL recordings. Proxy rows land under `\u003csink.dir\u003e/\u003cgateway_id\u003e/proxy/`; OTLP rows land under `\u003csink.dir\u003e/\u003cgateway_id\u003e/\u003csignal\u003e/`. Required when `otel` or `proxy` is set in Standalone mode. Accepted but unused in Gateway mode. |\n| `central_server` | Gateway-mode Central server URL, identity settings, config poll interval, and optional `outbox_dir`. Gateway rows are first fsynced to this durable local outbox, then shipped to Central ingest. |\n| `upload`  | Optional. Enables the daily S3 parquet drain. See [S3 upload](#s3-upload). |\n| `query`   | Optional. Configures the local `ctvs query` query cache. `query.cache.enabled` defaults to `true`; `query.cache.dir` defaults to `\u003crecording-root\u003e/.collectivus-query/cache`. |\n\n`--print-config` loads, validates, and pretty-prints the resolved config:\n\n```bash\nnpx collectivus --config collectivus.json --print-config\n```\n\n### v1 schema\n\n`version: 1` introduces array-shape `upstreams`, an optional `upload` block,\nand makes `sink` mandatory whenever `otel` or `proxy` is set in Standalone\nmode. v0 configs (missing the `version` field) hard-fail with a clear error —\nthe walkthrough writes v1 only.\n\n## LLM proxy mode\n\nThe proxy is a transparent reverse proxy for Anthropic's Messages API and\nOpenAI-compatible APIs. With `ANTHROPIC_BASE_URL=http://127.0.0.1:8787`,\nevery Claude Code call routes through collectivus, gets forwarded to\n`https://api.anthropic.com`, and is recorded to JSONL.\n\nTwo row kinds in `\u003csink.dir\u003e/\u003cgateway_id\u003e/proxy/\u003cUTC-date\u003e.jsonl`:\n\n**Per stream event** (one per SSE event for streamed responses):\n\n```json\n{\n  \"exchange_id\": \"01HX...\",\n  \"kind\": \"stream_event\",\n  \"t_ms\": 137,\n  \"event\": \"content_block_delta\",\n  \"data\": \"{\\\"type\\\":\\\"content_block_delta\\\",\\\"delta\\\":{\\\"text\\\":\\\"...\\\"}}\"\n}\n```\n\n**Per exchange** (always emitted, after the response completes):\n\n```json\n{\n  \"exchange_id\": \"01HX...\",\n  \"kind\": \"exchange\",\n  \"ts_start\": \"...\",\n  \"ts_end\":   \"...\",\n  \"duration_ms\": 14444,\n  \"upstream\": \"anthropic\",\n  \"client\":  { \"ip\": \"127.0.0.1\", \"user_agent\": \"claude-code/...\" },\n  \"request\": {\n    \"method\": \"POST\",\n    \"path\":   \"/v1/messages\",\n    \"headers\": { \"x-api-key\": \"REDACTED:abcd\", \"...\": \"...\" },\n    \"body\":    \"{...}\"\n  },\n  \"response\": { \"status\": 200, \"headers\": {}, \"body\": null },\n  \"stream_event_count\": 47,\n  \"error\": null\n}\n```\n\nFor non-streaming responses, only the `kind: \"exchange\"` row is written, with\n`response.body` populated.\n\n### Header redaction\n\nRequest and response headers in `proxy.redact_headers` are rewritten as\n`REDACTED:\u003clast 4 chars\u003e`. The default list (`authorization`, `x-api-key`,\n`anthropic-api-key`, `cookie`, `set-cookie`) is always applied — you can\nextend it but not shrink it.\n\nBodies are never auto-redacted: full visibility is the intended behavior.\n\n### Auth\n\nPass-through. The client's `x-api-key` is forwarded to upstream verbatim;\ncollectivus does not hold a credential.\n\n### Codex\n\nCodex can route through the same proxy by configuring a Codex model provider.\nUse an OpenAI upstream whose path prefix matches `/v1/responses`:\n\n```json\n{\n  \"version\": 1,\n  \"proxy\": {\n    \"listen\": \"127.0.0.1:8787\",\n    \"upstreams\": [\n      {\n        \"name\": \"openai\",\n        \"base_url\": \"https://api.openai.com\",\n        \"match\": { \"path_prefix\": \"/v1\" }\n      }\n    ]\n  },\n  \"sink\": { \"type\": \"file\", \"dir\": \"./collectivus-data\" }\n}\n```\n\nAttach or detach Codex explicitly:\n\n```bash\nctvs attach --config collectivus.json --client codex\nctvs detach --client codex\n```\n\nThis writes a managed provider to `~/.codex/config.toml` using Codex's\ndocumented `model_provider` / `model_providers.\u003cid\u003e` configuration format:\n\n```toml\nmodel_provider = \"collectivus\"\n\n[model_providers.collectivus]\nname = \"Collectivus OpenAI Proxy\"\nbase_url = \"http://127.0.0.1:8787/v1\"\nrequires_openai_auth = true\nwire_api = \"responses\"\nsupports_websockets = false\n```\n\n`supports_websockets = false` keeps Codex on HTTP/SSE requests, which is the\nproxy path collectivus records today. See OpenAI's Codex docs for the\nunderlying [configuration file](https://developers.openai.com/codex/config-basic#codex-configuration-file)\nand [provider fields](https://developers.openai.com/codex/config-reference#model_providers).\n\n## Local query\n\n`ctvs query` reads local recordings only. It never contacts S3 and it does not\nauto-refresh its query cache unless you ask for that explicitly.\n\n```bash\nctvs query refresh /path/to/gw1/logs/2026-05-11.jsonl --config collectivus.json\nctvs query refresh --all logs --config collectivus.json\nctvs query logs --config collectivus.json --since 1h\nctvs query traces slow --config collectivus.json --limit 20\nctvs query metrics series latency.ms --config collectivus.json\nctvs query proxy get \u003cconversation-id\u003e --config collectivus.json --format json\nctvs query sql \"select serviceName, count(*) as logs from logs group by serviceName\"\nctvs collect random-log.jsonl --name random-log --config collectivus.json\nctvs collect --glob '.gc/runtime/**/*.jsonl' --name session-segments --config collectivus.json\nctvs query sql \"select * from random_log\" --config collectivus.json\n```\n\nCache cursors are written under\n`\u003crecording-root\u003e/.collectivus-query/cache/datasets/\u003cdataset\u003e/gateway_id=\u003cid\u003e/date=\u003cYYYY-MM-DD\u003e/cursor.json`.\nRows live in local Iceberg tables under the same partition directory. Refreshes\nappend from the last recorded JSONL cursor when possible; truncation, rewrite,\nor schema drift starts a new source epoch.\n\nFreshness is treated asymmetrically (since v1.7.0):\n\n| Partition state | Behavior |\n| --- | --- |\n| `fresh` | Query proceeds silently. |\n| `stale` (cache exists, source changed since refresh) | Query proceeds; a `warning: query cache last refreshed at …` line is written to stderr. Stdout is unchanged. |\n| `missing` (no cache table/cursor) | Query exits with the exact file-targeted `ctvs query refresh …` command to run when the source file is known. |\n\nUse `ctvs query refresh \u003cfile.jsonl\u003e` to refresh selected source files, or\n`ctvs query refresh --all [dataset]` when you explicitly want the broader\nwalk. Repeat `--date` to query or refresh several UTC date partitions at once.\nUse `--refresh always` to force a refresh before the query runs. Use\n`--strict-freshness` to restore the pre-1.7 behavior where stale partitions\nare a hard error (useful in CI / scheduled jobs that must never read\noutdated data).\n\n\u003e **Migration note (v1.7.0).** Stale partitions no longer exit non-zero\n\u003e by default — scripts that depended on that exit code must add\n\u003e `--strict-freshness`. Stdout formats (table, json, jsonl, markdown) are\n\u003e unchanged; the new warning is written only to stderr. `missing`\n\u003e partitions still error.\n\nLogical datasets are `logs`, `traces`, `metrics`, `proxy_messages`, and `gascity_messages`. `ctvs collect \u003cfile.jsonl\u003e --name \u003cname\u003e` registers an external JSONL file as a dynamic table; `ctvs collect --glob '\u003cpattern\u003e' --name \u003cname\u003e` backs one table with many source files. Names are normalized for SQL, so `--name random-log` becomes table `random_log`; quoted SQL can also reference the original collection name as `\"random-log\"`. Collection tables include `_ctvs_source_path`, `_ctvs_line_number`, `_ctvs_raw`, and inferred top-level JSON fields. Deleted glob sources remain queryable from their cache-only partitions until the collection is removed. `ctvs query schema \u003ctable\u003e` prints the schema, and `ctvs query catalog` shows which datasets have source and cached rows.\n\n```bash\nctvs query sql \"select date, count(*) from proxy_messages group by date\" \\\n  --date 2026-05-14 --date 2026-05-15 --refresh always\n```\n\n### Conversation log model\n\nRecorded LLM proxy traffic is exposed as a single logical dataset, `proxy_messages`. Each row is one content part — a text block, reasoning block, tool call, tool result, image, file, or error — so a single assistant turn that contains text + a tool call + more text becomes three rows. The grain is per-part on purpose: callers can filter, count, and join parts without unpacking nested JSON, and downstream analytics (`SUM(usage)`, conversation walks, tool-call/result joins) become single-table SQL.\n\nRows are globally deduplicated by `message_id` — a 16-character hex prefix of `sha256(conversation_id : role : canonicalJson(content))`. Identical content in the same conversation always produces the same id, so the user-history blocks that Anthropic replays on every request are written once. The walker also tracks `previous_message_id` across exchanges so callers can reconstruct conversation order even after dedup.\n\n`conversation_id` is resolved tiered — Claude Code's `metadata.user_id.session_id` when present, otherwise a stable 16-hex hash of the first user message's content, otherwise a hash of `exchange_id` (so even single-shot malformed exchanges get a deterministic id). `conversation_source` is `claude_code` when the recorded user-agent starts with `claude-cli/`, else `api`. When Claude Code is configured through `ctvs attach`, a local hook records `cwd` and `git_branch` into the proxy JSONL so those fields survive Gateway/Central server shipping; local query/export also scans Claude Code transcripts to enrich matching rows with JSONL metadata such as `provider_uuid`, `parent_uuid`, `request_id`, `entrypoint`, `client_version`, and `user_type`.\n\nJSON columns (`attributes`, `status`, `tools`, `tool_args`, `compact_metadata`, `raw_frame`) carry sparse structured data; scalars are accessed with `JSON_VALUE(\u003ccol\u003e, '$.path')`. `attributes` holds request settings, per-message `usage` (assistant only), `timing.latency_ms`, and `client.claude_version` when available; `status` holds `tool_status` on tool results, `finish_reason` on the last assistant part, and `error_code` / `error_message` on error parts.\n\nFor the full per-column derivation table see [skills/collectivus-query/references/query-cli.md](skills/collectivus-query/references/query-cli.md).\n\n### Gascity source (`gascity_messages`)\n\n`ctvs gascity` is a separate listener that subscribes to a gascity supervisor's\nREST API, normalizes provider frames (Claude / Codex), and writes one row per\ncontent block (text / thinking / tool_use / tool_result / attachment) directly\nto Parquet at `~/.collectivus/sink/gascity_messages/date=\u003cYYYY-MM-DD\u003e/city=\u003cname\u003e/`.\nThere is no JSONL stage and no `.meta.json` sidecar: the sink IS the queryable\nstore, so `ctvs query gascity_messages` is always reading what the daemon has\nflushed up to the moment of the call.\n\n```bash\nctvs gascity attach hyptown\nctvs gascity list\nctvs query schema gascity_messages --format markdown\nctvs query sql \"select gascity_template, count(*) from gascity_messages group by 1\"\n```\n\n`gascity_messages` carries agent-identity columns the proxy can't see —\n`gascity_template`, `gascity_rig`, `gascity_alias` — plus per-frame token usage\nwith cache breakdown (`input_tokens`, `cache_read_input_tokens`,\n`cache_creation_input_tokens`). Use it when you need agent-attributed cost\nanalysis or tool-call inspection; use `proxy_messages` for HTTP-level retry\nvisibility and request timing. They UNION cleanly via `gateway_id` (a constant\n`gascity-scribe` on every gascity row tags the source).\n\nThe bundled [`ctvs-gascity` skill](src/cli/init_presets/gascity_skill.md) — installed per-workspace by\n`ctvs init gascity` — teaches Claude Code and Codex how to query all three\ngascity-aware tables (`events`, `session_segments`, `gascity_messages`) and\ntheir cross-source joins with `proxy_messages`.\n\n### LLM skill\n\nInstall the bundled `collectivus-query` skill so Claude Code and Codex know how\nto inspect local recordings with `ctvs query`:\n\n```bash\nctvs skills install --client all\n```\n\nThe skill assumes the default `~/.hyp/collectivus.json` config unless the agent\ndiscovers a non-default service config from `ctvs status` or the service unit.\n\n## OTLP receiver\n\nThe OTLP receiver accepts JSON and protobuf payloads on the standard endpoints:\n\n- `POST /v1/traces`\n- `POST /v1/metrics`\n- `POST /v1/logs`\n\nOutput layout under `sink.dir`:\n\n```\ncollectivus-data/\n└── \u003cgateway_id\u003e/\n    ├── raw/\n    │   ├── traces/\u003cUTC-date\u003e.jsonl       # raw export envelope\n    │   ├── metrics/\u003cUTC-date\u003e.jsonl\n    │   └── logs/\u003cUTC-date\u003e.jsonl\n    ├── traces/\u003cUTC-date\u003e.jsonl           # one row per span\n    ├── metrics/\u003cUTC-date\u003e.jsonl          # one row per data point\n    └── logs/\u003cUTC-date\u003e.jsonl             # one row per log record\n```\n\nEach normalized row includes the source `service.name`, while files are\npartitioned by `gateway_id`, signal, and date.\n\n### Verify the OTLP receiver\n\n```bash\nnpx collectivus --config collectivus.json \u0026\ncurl -X POST localhost:4318/v1/traces \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"resourceSpans\":[]}'\n```\n\n## CLI\n\n```text\nctvs --config \u003cpath\u003e                         Run with config file\nctvs --config \u003cpath\u003e --print-config          Validate + print resolved config\nctvs query \u003ccommand\u003e [...]                   Query local recordings\nctvs collect \u003cfile.jsonl\u003e|--glob \u003cpattern\u003e --name \u003cname\u003e\n                                             Add external JSONL as a query table\nctvs export --config \u003cpath\u003e [...]            Convert recorded JSONL to local Parquet (one-shot)\nctvs --help                                  Show usage\n```\n\n`SIGINT` and `SIGTERM` trigger graceful shutdown: stop accepting new requests,\ndrain in-flight, fsync sinks, exit 0.\n\n### Export to Parquet on demand\n\n`ctvs export` walks the configured sink dir and converts what it finds\ninto local Parquet. Runs once and exits — independent of the daily upload\nscheduler, and includes today's open files (which the upload pipeline\ndeliberately skips).\n\nTwo sinks are drained:\n\n| Source | Destination |\n| --- | --- |\n| `\u003csink.dir\u003e/\u003cgateway_id\u003e/proxy/\u003cdate\u003e.jsonl` (proxy recorder) | `\u003cout\u003e/proxy/messages.parquet` |\n| `\u003csink.dir\u003e/\u003cgateway_id\u003e/\u003csignal\u003e/\u003cdate\u003e.jsonl` (OTLP) | `\u003cout\u003e/\u003cgateway_id\u003e/\u003csignal\u003e/date=\u003cYYYY-MM-DD\u003e/data.parquet` |\n\nProxy export walks each gateway's days chronologically so the conversation\nwalker can dedupe `message_id`s across day boundaries, then concatenates the\nresult into a single `messages.parquet` file. The per-day `kind: \"exchange\"` /\n`kind: \"stream_event\"` JSONL rows on disk are unchanged — only the Parquet\nprojection was reshaped.\n\n```text\nctvs export --config \u003cpath\u003e [--out \u003cdir\u003e] [--date YYYY-MM-DD]\n                            [--gateway-id \u003cid\u003e] [--signal logs|traces|metrics]\n```\n\n`--date`, `--gateway-id`, and `--signal` only filter the OTLP path; proxy\nJSONL is always drained when present.\n\n## S3 upload\n\nStandalone and Central server modes write JSONL to their configured local\nrecording root. When the `upload` block is configured, a daily scheduler drains\nthe previous day's JSONL into Parquet partitions in S3. Object keys are\nHive-partitioned:\n\n```\n\u003cprefix\u003e/\u003cgateway_id\u003e/\u003csignal-or-dataset\u003e/date=\u003cYYYY-MM-DD\u003e/data.parquet\n```\n\nOTLP signals use `logs`, `traces`, or `metrics` as the middle segment. Proxy\ntraffic is materialized as the `proxy_messages` dataset.\n\nThis is useful for long-term retention, columnar queries with\nAthena / DuckDB / Snowflake, and offsite backup of recordings that would\notherwise live only on the daemon host. The local JSONL is the source of\ntruth; the S3 drain is additive and idempotent (a per-(gateway_id, signal,\ndate) ledger and a HEAD check on the destination key prevent duplicate\nuploads).\n\nAdd an `upload` block to drain JSONL to S3 once a day:\n\n```json\n{\n  \"version\": 1,\n  \"proxy\": { \"listen\": \"127.0.0.1:8787\", \"upstreams\": [] },\n  \"sink\":   { \"type\": \"file\", \"dir\": \"./collectivus-data\" },\n  \"upload\": {\n    \"bucket\": \"my-collectivus-archive\",\n    \"prefix\": \"collectivus\",\n    \"region\": \"us-east-1\",\n    \"time\":   \"00:10\",\n    \"signals\": [\"logs\", \"traces\", \"metrics\", \"proxy\"]\n  }\n}\n```\n\n| Field         | Required | Default     | Notes                                         |\n|---------------|----------|-------------|-----------------------------------------------|\n| `bucket`      | yes      | -           | Destination S3 bucket name.                   |\n| `prefix`      | no       | `collectivus` | Key prefix under the bucket.               |\n| `region`      | no       | `AWS_REGION` env, or `us-east-1` | AWS region. |\n| `time`        | no       | `00:10`     | Daily run time, `HH:MM` UTC.                  |\n| `signals`     | no       | all four    | Subset of `logs`, `traces`, `metrics`, `proxy`. |\n| `catchupDays` | no       | `30`        | Look back this many days for unuploaded JSONL. |\n| `endpoint`    | no       | -           | Custom S3-compatible endpoint (e.g. MinIO).   |\n\n### Credentials\n\nCredentials are never stored in the config. They are resolved at daemon start\nfrom one of these sources:\n\n- `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` for local/dev or explicit\n  static credentials.\n- ECS task-role credentials exposed through\n  `AWS_CONTAINER_CREDENTIALS_RELATIVE_URI` or\n  `AWS_CONTAINER_CREDENTIALS_FULL_URI`.\n- `AWS_CONTAINER_AUTHORIZATION_TOKEN` or\n  `AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE` when the container credential\n  endpoint requires an auth token.\n- `AWS_SESSION_TOKEN` (optional, for temporary credentials)\n- `AWS_REGION` (optional; the `upload.region` config field overrides this)\n\nWhen `upload` is set in the config but no supported AWS credential source is\navailable, the daemon fails fast at startup rather than at the first daily tick.\n\n## Programmatic use\n\n```javascript\nimport { Collector } from 'collectivus'\n\nconst collector = new Collector({ port: 4318, outputDir: './otel-data' })\nawait collector.start()\n// ...\nawait collector.stop()\n```\n\nThe proxy is config-driven only — programmatic embedding of the proxy is not\nyet a supported public API.\n\n## Install as a daemon (macOS, Linux)\n\nKeep collectivus running across reboots by registering it with the system\nprocess supervisor — a user LaunchAgent on macOS, a systemd user unit on\nLinux — and (optionally) route [Claude Code](https://docs.claude.com/en/docs/claude-code/overview)\nthrough the proxy in the same step.\n\n### Quickstart (macOS)\n\nThe recommended path is the top-level walkthrough:\n\n```bash\nnpx collectivus\n# What do you want to collect? Press Enter for all available sources,\n# or choose Claude Code.\n# ✓ Daemon installed (LaunchAgent: com.hyparam.collectivus)\n# ✓ Claude Code attached (~/.claude/settings.json)\n```\n\nIf you prefer direct `ctvs` commands:\n\n```bash\nnpm install -g collectivus\nctvs install --config /path/to/collectivus.json --yes\n# ✓ Daemon installed (LaunchAgent: com.hyparam.collectivus)\n# ✓ Claude Code attached (~/.claude/settings.json)\n# Logs: ~/.hyp/collectivus/collectivus.log\n```\n\nThe LaunchAgent is set with `RunAtLoad=true` and `KeepAlive=true`, so the\ndaemon starts at login and launchd restarts it if it exits. Logs land in\n`~/.hyp/collectivus/`:\n\n- `collectivus.log` — stdout\n- `collectivus.err.log` — stderr\n\n### Quickstart (Linux, systemd)\n\nThe same walkthrough creates the systemd user unit:\n\n```bash\nnpx collectivus\n# What do you want to collect? Press Enter for all available sources,\n# or choose Claude Code.\n# ✓ Daemon installed (systemd unit: com.hyparam.collectivus.service)\n# ✓ Claude Code attached (~/.claude/settings.json)\n```\n\nOr use direct `ctvs` commands:\n\n```bash\nnpm install -g collectivus\nctvs install --config /path/to/collectivus.json\n# ✓ Daemon installed (systemd unit: com.hyparam.collectivus.service)\n# ✓ Claude Code attached (~/.claude/settings.json)\n```\n\nOn Linux, `install` writes a systemd user unit to\n`~/.config/systemd/user/com.hyparam.collectivus.service`, runs\n`systemctl --user daemon-reload`, then `enable` and `restart`. The unit is\nconfigured with `Restart=always`, `RestartSec=5`, and\n`WantedBy=default.target` so systemd starts it at login and respawns it on\nexit. Logs are written via `StandardOutput=append:` /\n`StandardError=append:` to the directory you pass as `logDir` (the CLI\ndefaults to `~/.hyp/collectivus`).\n\n\u003e **Linger required for non-login boots.** User-level systemd services run\n\u003e only while the user has a session. To keep the daemon up across reboots\n\u003e when you are not logged in (e.g. a headless server), enable lingering\n\u003e once:\n\u003e\n\u003e ```bash\n\u003e sudo loginctl enable-linger \"$USER\"\n\u003e ```\n\u003e\n\u003e Without this, the unit stops when your last login session ends and only\n\u003e restarts when you log back in.\n\nSystem-level systemd units (root-owned, in `/etc/systemd/system/`) and\nnon-systemd init systems (Alpine's OpenRC, Void's runit, etc.) are not\nsupported in this build.\n\n### Subcommands\n\n| Command | Purpose |\n|---------|---------|\n| `ctvs install --config \u003cpath\u003e [--yes\\|--no]` | Install LaunchAgent and (optionally) attach Claude Code |\n| `ctvs uninstall` | Stop and remove the LaunchAgent; revert any attached clients (Claude Code, Codex) |\n| `ctvs attach (--config \u003cpath\u003e \\| --port \u003cn\u003e) [--client claude\\|codex\\|all]` | Route Claude Code and/or Codex through the proxy without touching the daemon |\n| `ctvs detach [--client claude\\|codex\\|all]` | Revert Claude Code and/or Codex without uninstalling the daemon |\n| `ctvs status` | Print daemon (loaded / PID) and Claude Code (attached) state |\n| `ctvs export --config \u003cpath\u003e [...]` | Convert recorded JSONL to local Parquet without invoking the upload scheduler |\n| `ctvs query \u003ccommand\u003e [...]` | Query local recordings through the explicit query cache |\n| `ctvs collect \u003cfile.jsonl\u003e\\|--glob \u003cpattern\u003e --name \u003cname\u003e` | Register external JSONL as a dynamic query table |\n| `ctvs skills install [--client claude\\|codex\\|all]` | Install the bundled Collectivus query LLM skill |\n\nIf stdin is not a TTY, `install` refuses to guess: pass `--yes` to attach\nClaude Code unattended, or `--no` to skip the attach step.\n\n### Status\n\n```bash\nctvs status\n# Daemon\n#   Status: loaded (PID 12345)\n#   Plist: /Users/you/Library/LaunchAgents/com.hyparam.collectivus.plist\n#   Config: /path/to/collectivus.json\n#   Logs:\n#     stdout: /Users/you/Library/Logs/Collectivus/collectivus.log\n#     stderr: /Users/you/Library/Logs/Collectivus/collectivus.err.log\n#\n# Claude Code\n#   Status: attached\n#   Attached at: 2026-05-07T16:30:00.000Z\n#   Port: 8787\n#   Marker version: 1.1.0\n#   Settings: /Users/you/.claude/settings.json\n```\n\n\u003e On Linux, `ctvs status` does not yet report systemd unit state.\n\u003e Use `systemctl --user status com.hyparam.collectivus.service` for the\n\u003e daemon view; `ctvs status` will still report Claude Code attach\n\u003e state correctly.\n\n### Reverting\n\n```bash\nctvs detach [--client claude|codex|all]          # un-route, leave daemon running\nctvs uninstall                                   # remove daemon and revert attached clients\n```\n\nAll revert paths are idempotent and tolerate already-reverted state.\n\n## Advanced deployments\n\nStandalone is the default mode: one machine owns its config, local proxy,\nrecordings, and query cache. Use Gateway and Central server only when a fleet\nneeds central config vending, durable gateway outboxes, and one canonical\ningest store. The JSON schema still uses `role: \"server\"` for the Central\nserver role and `role: \"gateway\"` for managed hosts.\n\n### Containers\n\nThe GHCR image uses `ctvs` as its entrypoint, so container commands mirror the\nCLI:\n\n```bash\ndocker pull ghcr.io/hyparam/collectivus:latest\ndocker run --rm ghcr.io/hyparam/collectivus:latest --help\n\n# Standalone, Gateway, or Central server: selected by role in the config file.\ndocker run --rm ghcr.io/hyparam/collectivus:latest --config /config/collectivus.json\n\n# Same, but with config JSON injected as an environment variable.\ndocker run --rm -e COLLECTIVUS_CONFIG_JSON ghcr.io/hyparam/collectivus:latest \\\n  --config-env COLLECTIVUS_CONFIG_JSON\n\n# Hosted-discovery rendezvous service.\ndocker run --rm ghcr.io/hyparam/collectivus:latest rendezvous --help\n```\n\n### Config vending (multi-host deployments)\n\nCentral server vendors per-gateway configs over `GET /v1/config`, accepts\ngateway ingest, and can print one-line setup commands for Gateway hosts.\nGateways poll `central_server.poll_interval_seconds` and hot-reload only the\nlistener whose section changed.\n\n```bash\n# Central server host.\nnpx collectivus --config /etc/collectivus-server.json\n\n# Operator workflow on the Central server host.\nctvs config bootstrap-token issue gw-prod-1 --server-config /etc/collectivus-server.json\nctvs config set gw-prod-1 --server-config /etc/collectivus-server.json --file gw-prod-1.json\nctvs config list --server-config /etc/collectivus-server.json\n\n# Gateway host, using the command printed by the token issuer.\nnpx collectivus --config-endpoint='https://collectivus.internal:8788/v1/bootstrap-config?token=bt_abc123...'\n```\n\nGateway mode treats Central server as the canonical recording store. Proxy and\nOTLP rows are first fsynced to `central_server.outbox_dir`, then shipped to\n`POST /v1/ingest/\u003csignal\u003e`.\n\n### Self-hosting and rendezvous\n\nThe repo ships a reference [`docker-compose.yml`](docker-compose.yml) and\n[`.env.example`](.env.example) that run Central server plus hosted-discovery\nrendezvous for the `ctvs invite create` to `ctvs join` flow:\n\n```bash\ncp .env.example .env\n# Fill in COLLECTIVUS_ADMIN_TOKEN, COLLECTIVUS_IDENTITY_SECRET,\n# COLLECTIVUS_RENDEZVOUS_REGISTRATION_TOKEN, COLLECTIVUS_RENDEZVOUS_URL,\n# and COLLECTIVUS_PUBLIC_URL.\ndocker compose up -d\n```\n\nRendezvous stores only join-code hashes and Central server connect metadata;\nit does not store plaintext join codes, configs, telemetry, JWTs, issuer\nsecrets, or bootstrap tokens. The full Docker walkthrough covers TLS,\nsecret rotation, backups, and troubleshooting in\n[`docs/self-hosting-docker.md`](docs/self-hosting-docker.md). The Claude Code\nwalkthrough has the shorter Gateway/Central path in\n[`docs/walkthrough-claude-code.md`](docs/walkthrough-claude-code.md#multi-host-gateway-pulling-its-config-from-a-central-server).\n\n### AWS ECS\n\nAn optional CDK app under [`infra/aws/`](infra/aws/) deploys Central server and\nrendezvous as ECS Fargate tasks behind ALBs, with encrypted EFS state and a\nprivate S3 archive bucket. Use it only when AWS is already the deployment\ntarget; the Docker Compose path is the simpler self-hosting default.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fcollectivus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyparam%2Fcollectivus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fcollectivus/lists"}