{"id":51139460,"url":"https://github.com/jacovinus/snoutdb","last_synced_at":"2026-06-25T21:01:15.015Z","repository":{"id":364134093,"uuid":"1266560532","full_name":"jacovinus/snoutdb","owner":"jacovinus","description":"Local columnar analytics for CSV, JSONL, logs, and .snout files","archived":false,"fork":false,"pushed_at":"2026-06-11T18:45:38.000Z","size":261,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-11T20:15:12.792Z","etag":null,"topics":["analytics","cli","columnar","csv","embedded-database","jsonl","logs","odin","snout"],"latest_commit_sha":null,"homepage":"","language":"Odin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jacovinus.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-11T18:27:48.000Z","updated_at":"2026-06-11T18:55:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jacovinus/snoutdb","commit_stats":null,"previous_names":["jacovinus/snoutdb"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/jacovinus/snoutdb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacovinus%2Fsnoutdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacovinus%2Fsnoutdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacovinus%2Fsnoutdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacovinus%2Fsnoutdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jacovinus","download_url":"https://codeload.github.com/jacovinus/snoutdb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacovinus%2Fsnoutdb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34792207,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-25T02:00:05.521Z","response_time":101,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","cli","columnar","csv","embedded-database","jsonl","logs","odin","snout"],"created_at":"2026-06-25T21:01:13.425Z","updated_at":"2026-06-25T21:01:14.997Z","avatar_url":"https://github.com/jacovinus.png","language":"Odin","funding_links":[],"categories":[],"sub_categories":[],"readme":"# SnoutDB\n\n[![CI](https://github.com/jacovinus/snoutdb/actions/workflows/ci.yml/badge.svg)](https://github.com/jacovinus/snoutdb/actions/workflows/ci.yml)\n![Version: v0.2.1](https://img.shields.io/badge/version-v0.2.1-4c6ef5)\n![Tests: 343 passing](https://img.shields.io/badge/tests-343_passing-2f9e44)\n![License: AGPL v3](https://img.shields.io/badge/license-AGPL_v3-blue)\n![Language: Odin](https://img.shields.io/badge/language-Odin-3882d2)\n\n\u003e **From an unfamiliar file to ranked findings and reproducible queries.**\n\nMost data tools are excellent once you know the schema and the question.\nSnoutDB targets the step before that.\n\nPoint it at an unfamiliar CSV, JSONL, log, or `.snout` file:\n\n```bash\n./snout hunt application.log\n```\n\n![Hunt compact output — severity overview bar, frequent log patterns, and ranked findings each with a one-line temporal sparkline](docs/assets/hunt-compact-overview.png)\n\n### What Hunt returns\n\nEvery `hunt` run answers five questions at once, in a single page of output you\ncan read in seconds:\n\n1. **What kind of file is this?** A severity stack and per-level counts\n   summarise the distribution of events at a glance.\n2. **What is normal in here?** The *frequent patterns* block lists the most\n   common message templates per severity, with a time range so you know whether\n   they happened all day or only inside a window.\n3. **What should I look at first?** The *attention* block ranks findings by a\n   severity-aware score — errors, critical bursts, anomalous concentrations,\n   missing data, metric tails, time spikes, and dominant contributors all flow\n   into the same ranking.\n4. **When did each finding happen?** Every ranked finding ships with a\n   one-line sparkline showing where its events fall inside the file's full\n   timeline — distinguishing a sharp burst, a slow ramp, or a steady scatter\n   without leaving the page.\n5. **How do I reproduce the evidence?** Each finding comes with a shell-safe\n   SnoutDB command that re-runs the underlying slice of data, so the answer is\n   always one command away from being verified.\n\n`--verbose` expands each finding with a full-width histogram, peak count and\ntime, first/last match timestamps, a representative sample (with `\u003cuuid\u003e`,\n`\u003cn\u003e`, URLs preserved), and a grouped reproduce footer:\n\n```bash\n./snout hunt application.log --verbose\n./snout hunt application.log --format json\n./snout hunt application.log --verbose -o incident-report.md\n```\n\n![Hunt verbose output — severity overview bar, frequent log patterns, and the first attention findings with detailed temporal histograms](docs/assets/hunt-severity-and-critical-burst.png)\n\n`sniff` remains the lightweight schema and profiling command when you only need\ncolumn roles, statistics, and query suggestions:\n\n```bash\n./snout sniff -f requests.csv\n```\n\nThis is not a claim that SnoutDB replaces DuckDB, Miller, qsv, VisiData, or\n`jq`. It is a focused reconnaissance tool for the moment when a file lands in\nfront of you and you do not yet know what matters inside it.\n\n## Hunt: Automatic Local Analysis\n\nHunt is SnoutDB's primary discovery workflow. It is deterministic, explainable,\nand runs locally without an account, network service, data upload, telemetry,\nor an LLM dependency.\n\n| Capability | What Hunt reports |\n|---|---|\n| Severity overview | Normalized `CRITICAL`, `ERROR`, `WARN`, `INFO`, `DEBUG`, `TRACE`, and unknown levels |\n| Frequent context | Common message templates with counts and time ranges |\n| Attention ranking | Severity-aware, deterministic ordering of the strongest findings |\n| Generic analytics | Concentration, error hotspots, metric outliers, null anomalies, temporal shifts, and top contributors |\n| Log analytics | Message normalization, representative samples, temporal histograms, peaks, and first/last matches |\n| Reproduction | Shell-safe SnoutDB commands for investigating the underlying evidence |\n| Structured output | Stable JSON and JSONL without ANSI escape sequences |\n| Reports | Color-free `.txt` or structured `.md` export with `-o` / `--output` |\n\nThe compact report is designed for quick triage. The verbose report is designed\nfor investigation:\n\n```bash\n./snout hunt app.log\n./snout hunt app.log --verbose\n./snout hunt app.log --limit 20 --min-score 70\n./snout hunt app.log --verbose -o hunt-report.md\n```\n\nHunt accepts CSV, JSONL/NDJSON, supported log files, and `.snout` files. It does\nnot currently accept stdin; import piped data to `.snout` first. Exported\nreports can contain samples derived from the input, so review them before\nsharing outside the environment where the data is authorized for use.\n\n## The Specific Advantage\n\n- **It identifies what deserves attention.** Hunt separates common background\n  activity from ranked findings and preserves the evidence behind each result.\n- **It proposes the first useful questions.** Type inference alone tells you\n  that a column is numeric. SnoutDB also decides whether it looks like a metric,\n  dimension, identifier, or timestamp and uses that role to generate queries.\n- **The result is executable, not just descriptive.** Suggestions are ordinary\n  CLI commands that can be inspected, changed, scripted, and rerun.\n- **Operational files are first-class inputs.** It reads CSV, JSONL, stdin,\n  CLF, Combined, logfmt, syslog, and custom regex logs without a server or\n  import step for discovery.\n- **Discovery and repeated analysis stay in one workflow.** Once a file is\n  understood, it can be queried directly or saved as a typed `.snout` snapshot\n  for repeated local work.\n- **It is deliberately narrow.** There is no service, account, notebook, or SQL\n  dialect between the file and the first answer.\n\n## Choose the Right Tool\n\n| Situation | Better fit |\n|---|---|\n| “I received this file and do not know its schema or what to investigate.” | **SnoutDB** |\n| “I know the question and want SQL, joins, extensions, or broad analytical power.” | **DuckDB** |\n| “I need mature record transformations in a Unix pipeline.” | **Miller or qsv** |\n| “I want to explore the data interactively in a terminal UI.” | **VisiData** |\n| “I primarily need to select or transform JSON documents.” | **jq** |\n\nSnoutDB should earn its place by shortening **unknown file → useful\ninvestigation**. If you already know the schema and query, a more mature tool\nwill often be the better choice.\n\n## Try It in One Minute\n\nRequirements: [Odin](https://odin-lang.org/docs/install/) and a shell.\n\n```bash\ngit clone https://github.com/jacovinus/snoutdb.git\ncd snoutdb\n./scripts/quickstart.sh\n```\n\nThe script builds SnoutDB, creates a temporary six-row dataset, runs `sniff`,\nruns Hunt against a bundled application-log fixture, executes a filtered\npercentile query, and creates a `.snout` snapshot. It uses no downloaded\ndataset or package-manager dependency.\n\n```text\ncolumn      type       role       nulls  distinct  details\n----------  ---------  ---------  -----  --------  ----------------------------\nservice     String     Dimension      0         3  top: checkout (3), users (2)\nlatency_ms  Int64      Metric         0         6  min=27 mean=169.83 max=441\n\nsuggested queries\n-----------------\n1. compare latency_ms across region\n```\n\n## Current Limits\n\n- SnoutDB is pre-`v1.0.0`; the CLI, C ABI, and `.snout` format may evolve.\n- Automated CI currently runs on macOS; other platforms should be considered\n  experimental until they have dedicated runners.\n- Grouped queries and some transforms materialize typed tables in memory.\n- Percentiles are exact and retain values for each aggregate group.\n- `.snout` stores chunk statistics, but query-time chunk skipping is not yet\n  implemented.\n- Hunt currently materializes supported inputs as a typed table and does not\n  accept stdin.\n- Hunt configuration files and historical baselines are not yet implemented.\n- Findings are evidence-based analytical signals, not claims of root cause.\n\nSee [benchmarks/README.md](benchmarks/README.md) for reproducible measurements.\nThe current development baseline profiles a generated 5-million-row, 751 MiB\nCSV in approximately 6.8 seconds on an Apple M4 Pro.\n\n## Contents\n\n- [Hunt: automatic local analysis](#hunt-automatic-local-analysis)\n- [The specific advantage](#the-specific-advantage)\n- [Choose the right tool](#choose-the-right-tool)\n- [Try it in one minute](#try-it-in-one-minute)\n- [Current limits](#current-limits)\n- [Version](#version)\n- [How it works](#how-it-works)\n- [What is a `.snout` file?](#what-is-a-snout-file)\n- [Community](#community)\n- [License and data handling](#license-and-data-handling)\n- [Build](#build)\n- [Step 1 — Look at your data](#step-1--look-at-your-data)\n- [Step 2 — Get statistics on a column](#step-2--get-statistics-on-a-column)\n- [Step 3 — Explore an unfamiliar file (sniff)](#step-3--explore-an-unfamiliar-file-sniff)\n- [Hunt — Discover what deserves attention](#hunt--discover-what-deserves-attention)\n- [Step 4 — Ask questions about your data](#step-4--ask-questions-about-your-data)\n- [Step 5 — Save your data as a .snout file](#step-5--save-your-data-as-a-snout-file)\n- [Step 6 — Combine multiple files](#step-6--combine-multiple-files)\n- [Step 7 — Reshape your data (transform)](#step-7--reshape-your-data-transform)\n- [Step 8 — Analyze log files](#step-8--analyze-log-files)\n- [Step 9 — Embed SnoutDB in your application (C API)](#step-9--embed-snoutdb-in-your-application-c-api)\n- [Large files](#large-files)\n- [Timing](#timing)\n- [Quick reference](#quick-reference)\n- [Real-world use cases](docs/USE-CASES.md)\n- [Benchmarks](benchmarks/README.md)\n- [Roadmap](ROADMAP.md)\n\n---\n\n## Version\n\nCurrent version: `v0.2.1`.\n\nSnoutDB is early-stage software. The CLI, C ABI, and `.snout` format may still\nchange before `v1.0.0`.\n\nCheck the installed CLI version with:\n\n```bash\n./snout version\n```\n\nSee [CHANGELOG.md](CHANGELOG.md) for the contents of this snapshot.\n\n---\n\n## How it works\n\nSnoutDB is a local, layered analytics pipeline. Raw files enter through\nstreaming readers, become typed columnar data, and then flow through profiling,\nquery, transformation, merge, or persistence operations.\n\n```mermaid\nflowchart LR\n    subgraph Inputs\n        CSV[\"CSV\"]\n        JSONL[\"JSONL / NDJSON\"]\n        LOG[\"Logs\u003cbr/\u003eCLF, Combined, logfmt,\u003cbr/\u003esyslog, app, bracketed, regex\"]\n        SNOUT[\".snout\"]\n        STDIN[\"stdin\"]\n    end\n\n    subgraph Engine[\"SnoutDB engine\"]\n        INGEST[\"Ingest\u003cbr/\u003escan + infer schema\"]\n        CORE[\"Core table\u003cbr/\u003etyped column slices\"]\n        SNIFF[\"Sniff\u003cbr/\u003eprofile + suggestions\"]\n        HUNT[\"Hunt\u003cbr/\u003eanalyze + rank + reproduce\"]\n        QUERY[\"Query\u003cbr/\u003efilter + group + sort\"]\n        TRANSFORM[\"Transform\u003cbr/\u003ereshape columns\"]\n        MERGE[\"Merge\u003cbr/\u003eappend + consolidate + rollup\"]\n        STORAGE[\"Storage\u003cbr/\u003echunked columnar format\"]\n    end\n\n    subgraph Outputs\n        TERMINAL[\"Terminal table\"]\n        DATA[\"CSV / JSON / JSONL\"]\n        REPORT[\"TXT / Markdown report\"]\n        FILE[\".snout file\"]\n        ABI[\"C ABI\u003cbr/\u003ePython, Go\"]\n    end\n\n    CSV --\u003e INGEST\n    JSONL --\u003e INGEST\n    LOG --\u003e INGEST\n    STDIN --\u003e INGEST\n    SNOUT --\u003e STORAGE\n    INGEST --\u003e|\"load or query\"| CORE\n    INGEST -.-\u003e|\"streaming profile\"| SNIFF\n    STORAGE \u003c--\u003e CORE\n    CORE --\u003e SNIFF\n    CORE --\u003e HUNT\n    CORE --\u003e QUERY\n    CORE --\u003e TRANSFORM\n    CORE --\u003e MERGE\n    TRANSFORM --\u003e STORAGE\n    MERGE --\u003e STORAGE\n    SNIFF --\u003e TERMINAL\n    SNIFF --\u003e DATA\n    HUNT --\u003e TERMINAL\n    HUNT --\u003e DATA\n    HUNT --\u003e REPORT\n    QUERY --\u003e TERMINAL\n    QUERY --\u003e DATA\n    STORAGE --\u003e FILE\n    CORE --\u003e ABI\n```\n\n### Query lifecycle\n\nA grouped query is intentionally small and deterministic: load typed columns,\nselect matching rows, build groups, update aggregate state, then sort and\nrender.\n\n```mermaid\nsequenceDiagram\n    participant User\n    participant CLI as cmd/snout\n    participant Loader as ingest/storage\n    participant Query as query\n    participant Aggregate as exec\n    participant Output as output\n\n    User-\u003e\u003eCLI: snout -f data group=region -- p95=latency count=rows\n    CLI-\u003e\u003eLoader: load input as core.Table\n    Loader--\u003e\u003eCLI: typed column slices\n    CLI-\u003e\u003eQuery: filters + group columns + aggregate specs\n    Query-\u003e\u003eQuery: build row selection\n    Query-\u003e\u003eQuery: hash rows into groups\n    loop selected rows\n        Query-\u003e\u003eAggregate: update aggregate state\n    end\n    Aggregate--\u003e\u003eQuery: count and percentile values\n    Query-\u003e\u003eQuery: sort and apply limit\n    Query--\u003e\u003eOutput: Group_Result_Set\n    Output--\u003e\u003eUser: table, CSV, JSON, or JSONL\n```\n\n### `.snout` storage\n\nThe native format is versioned and chunked. Each chunk contains one block per\ncolumn, including null information and numeric min/max statistics. String and\ntimestamp blocks use dictionary encoding when it is smaller than plain\nencoding.\n\n```mermaid\nflowchart LR\n    HEADER[\"Header\u003cbr/\u003emagic, version,\u003cbr/\u003esize, footer offset\"]\n    META[\"Table metadata\u003cbr/\u003ename, rows,\u003cbr/\u003ecolumns, chunks\"]\n    DESC[\"Column descriptors\u003cbr/\u003ename, type, nullable\"]\n    CHUNKS[\"Row chunks\u003cbr/\u003eup to 65,536 rows\"]\n    COLS[\"Column blocks\u003cbr/\u003eencoding, null count,\u003cbr/\u003emin/max, payload\"]\n    FOOTER[\"Footer\u003cbr/\u003emagic + file size\"]\n\n    HEADER --\u003e META --\u003e DESC --\u003e CHUNKS --\u003e COLS --\u003e FOOTER\n```\n\n| Capability | Main packages | What they do |\n|---|---|---|\n| Read data | `ingest`, `storage` | Stream CSV, JSONL, logs, or read `.snout` |\n| Represent data | `core` | Typed structure-of-arrays tables with explicit ownership |\n| Understand data | `sniff` | Cardinality, roles, statistics, outliers, suggestions |\n| Discover findings | `hunt` | Severity, patterns, anomalies, ranking, evidence, reproduction |\n| Analyze data | `query`, `exec` | Filters, groups, sorting, counts, averages, percentiles |\n| Reshape data | `transform` | Rename, cast, derive, bucket, truncate, extract |\n| Combine data | `merge` | Append, consolidate, compact, and roll up |\n| Present results | `output`, `terminal` | Tables, CSV, JSON, and JSONL |\n| Embed SnoutDB | `cabi` | Experimental C ABI for native integrations |\n\nThe implementation is a dependency DAG with `core` at the bottom and no\ncircular package dependencies. See [ARCHITECTURE.md](ARCHITECTURE.md) for the\npackage-level design and ownership rules.\n\n---\n\n## What is a `.snout` file?\n\nA `.snout` file is SnoutDB's native binary table format. It is the persisted\nform of a typed `core.Table`: column names, column types, null values, and data\nare stored together in one local file.\n\nThink of it as a reusable, query-ready snapshot of a CSV, JSONL file, log, or\nrollup result:\n\n```mermaid\nflowchart LR\n    RAW[\"Raw input\u003cbr/\u003eCSV, JSONL, logs\"]\n    IMPORT[\"Import once\u003cbr/\u003eparse + infer schema\"]\n    SNOUT[\"dataset.snout\u003cbr/\u003etyped columnar snapshot\"]\n    QUERY[\"Query repeatedly\"]\n    TRANSFORM[\"Transform\"]\n    MERGE[\"Merge / rollup\"]\n\n    RAW --\u003e IMPORT --\u003e SNOUT\n    SNOUT --\u003e QUERY\n    SNOUT --\u003e TRANSFORM\n    SNOUT --\u003e MERGE\n```\n\nFor example:\n\n```bash\n# Parse and infer the raw file once\n./snout csv-import calls.csv calls.snout\n\n# Reuse the typed snapshot for later analysis\n./snout info calls.snout\n./snout stats calls.snout jitter_ms\n./snout -f calls.snout group=region -- p95=jitter_ms count=rows\n```\n\n### Why use it?\n\n| Benefit | What it means |\n|---|---|\n| Schema is preserved | Types and nullable columns do not need to be inferred again |\n| Text parsing is avoided | Repeated queries read typed binary values instead of reparsing CSV, JSON, or logs |\n| Columnar organization | Values of the same column and type are stored together |\n| Compact repeated strings | String and timestamp columns use dictionary encoding when it saves space |\n| Safer local persistence | Files include magic bytes, versioning, size validation, and a footer |\n| Ready for data workflows | `.snout` files can be queried, transformed, appended, consolidated, compacted, or rolled up |\n| Language embedding | The C ABI can open `.snout` files directly with `snout_open` |\n\n### What is inside?\n\nThe format divides rows into chunks of up to **65,536 rows**. Within each\nchunk, every column has its own typed block:\n\n- a null mask;\n- the encoded column values;\n- a null count;\n- numeric minimum and maximum values;\n- an encoding marker (`Plain` or `Dictionary`).\n\nThis structure prepares the format for optimizations such as skipping chunks\nwhose min/max range cannot match a filter. The metadata is already stored, but\nquery-time chunk skipping is **not yet implemented** in `v0.2.1`.\n\n### When should you create one?\n\nUse `.snout` when:\n\n- you will query the same raw dataset more than once;\n- parsing or schema inference is a noticeable part of the workflow;\n- you need to merge files collected by day, service, region, or source;\n- you want to save a transformed dataset or a compact rollup;\n- you are opening the data through the C ABI.\n\nQuery the raw file directly when:\n\n- it is a one-off inspection;\n- you only need `sniff`, which can profile CSV, JSONL, and logs as a bounded-memory stream;\n- keeping the original text format is more important than reusable typed storage.\n\n`.snout` is an application format, not a general interchange standard. Keep\nthe original source files when you need interoperability with other tools.\nThe format is versioned and the v2 reader retains v1 compatibility, but the\nformat may still evolve before SnoutDB `v1.0.0`.\n\n---\n\n## Community\n\nContributions and focused technical discussion are welcome.\n\n| Resource | Purpose |\n|---|---|\n| [Contributing Guide](CONTRIBUTING.md) | Setup, branches, commits, tests, benchmarks, and PR expectations |\n| [Code of Conduct](CODE_OF_CONDUCT.md) | Expected behavior and enforcement |\n| [Security Policy](SECURITY.md) | Private vulnerability reporting and supported versions |\n| [Support Guide](SUPPORT.md) | Usage questions, bugs, and feature requests |\n| [Repository Guide](docs/REPOSITORY-GUIDE.md) | Branch protection, merge strategy, labels, and releases |\n| [Benchmarks](benchmarks/README.md) | Reproducible performance methodology and current baseline |\n| [Roadmap](ROADMAP.md) | Near-term priorities, path to v1.0, and non-goals |\n\nThe project uses short-lived branches, Conventional Commits, focused pull\nrequests, strict Odin checks, and squash merges. Every behavior change should\ninclude tests; hot-path changes should include benchmark evidence.\n\n---\n\n## License And Data Handling\n\nSnoutDB is distributed under the\n[GNU Affero General Public License v3](LICENSE). The license text is the\nauthoritative source for redistribution and modification terms; this summary\nis not legal advice. The experimental C ABI is part of the same AGPL-licensed\nproject and is not distributed under a separate permissive exception.\n\nSnoutDB runs locally and does not include an account system, hosted service,\ndata upload, or telemetry. Hunt reports may contain input-derived values,\nmessage samples, timestamps, paths, and reproduction commands. Treat exports\nas potentially sensitive and review or redact them before sharing.\n\n---\n\n## Build\n\n### 1. Install Odin\n\nInstall a current Odin release using the\n[official installation guide](https://odin-lang.org/docs/install/).\n\nOn macOS, Homebrew provides the shortest setup:\n\n```bash\nbrew install odin\nodin version\n```\n\nOdin publishes builds for macOS, Linux, Windows, and several BSD targets.\nSnoutDB's automated validation currently runs on macOS.\n\n### 2. Build SnoutDB\n\n```bash\n# CLI binary\nodin build ./cmd/snout -out:snout\n\n# Shared C library (optional — needed for FFI / embedding)\nodin build ./cabi -build-mode:shared -out:libsnout\n```\n\n### 3. Run tests\n\n```bash\nodin test ./tests -out:tests/snout_tests\n```\n\nAll 343 tests should pass in under a second. Tests must run from the repo root\nbecause fixture paths are relative to `tests/fixtures/`.\n\n---\n\n## Step 1 — Look at your data\n\nBefore doing anything else, let SnoutDB tell you what's in a file:\n\n```bash\n./snout csv-info mydata.csv\n./snout jsonl-info mydata.jsonl\n./snout log-info access.log       # auto-detects CLF, logfmt, syslog, …\n```\n\nThis shows you the column names, their types (number, text, true/false, date), and whether any values are missing.\n\n**Example output — CSV:**\n```\ntable: calls\nrows: 500\ncolumns:\n  region       String    nullable=false\n  carrier      String    nullable=false\n  jitter_ms    Float64   nullable=true\n  roaming      Bool      nullable=true\n  result       String    nullable=false\n```\n\n**Example output — access log (CLF auto-detected):**\n```\ntable: access\nrows: 12847\ncolumns:\n  ip           String      nullable=false\n  timestamp    Timestamp   nullable=false\n  method       String      nullable=false\n  path         String      nullable=false\n  status       Int64       nullable=false\n  bytes        Int64       nullable=true\n```\n\n**Reading from stdin:** pipe data in by passing `-f -`. SnoutDB auto-detects CSV vs JSONL from the first line. Log files can also be piped — use `log-import` to write them to a temp `.snout` file first, or pipe directly into `sniff`:\n\n```bash\n# Read CSV from stdin\ncat mydata.csv | ./snout -f - group=region -- count=rows\n```\n```\nregion    count\n--------  -----\nus-east      89\nus-west      79\neu-west      87\neu-east      64\nap-south     71\nap-north    110\n```\n\n```bash\n# Profile a log directly from stdin\ncat access.log | ./snout sniff -f -\n```\n```\ncolumn    type       role        nulls   distinct  details\n--------  ---------  ----------  ------  --------  -----------------------------------------\nip        String     Identifier      0      8231  (high cardinality — 8231 unique values)\nmethod    String     Dimension       0         5  top: GET (9115), POST (894), PUT (990)\npath      String     Identifier      0      2341  (high cardinality — 2341 unique values)\nstatus    Int64      Metric          0         6  min=200 mean=231 max=504 σ=82 outliers=0\nbytes     Int64      Metric          0      4821  min=0 mean=3723 max=982341 σ=14821 outliers=23\ntimestamp Timestamp  Timestamp       0     12847  2026-06-11T00:00:03Z → 2026-06-11T23:59:58Z\n```\n\n```bash\n# Import a log from stdin, then query it\ncat access.log | ./snout log-import - access_tmp.snout \u0026\u0026 \\\n  ./snout -f access_tmp.snout group=status -- count=rows --sort count=rows desc\n```\n```\nwritten: access_tmp.snout\ntable: access_tmp\nrows: 12847\ncolumns: 6\n\nstatus  count\n------  -----\n200      9104\n404      1823\n301       892\n403       421\n304       295\n500       312\n```\n\n---\n\n## Step 2 — Get statistics on a column\n\nWant to know the range and distribution of a column?\n\n```bash\n./snout csv-stats mydata.csv jitter_ms\n\n# For log files, import first then run stats on the .snout file\n./snout log-import access.log access.snout\n./snout stats access.snout bytes\n```\n\n**Example output — CSV column:**\n```\ncolumn: jitter_ms\ntype: Float64\ncount: 487\nnulls: 13\nsum: 27431.200000\navg: 56.330000       ← average value\nmin: 0.500000        ← lowest value\nmax: 99.800000       ← highest value\np50: 55.200000       ← half of values are below this (the \"middle\")\np95: 93.100000       ← 95% of values are below this\np99: 98.400000       ← 99% of values are below this\n```\n\n**Example output — log column (bytes transferred):**\n```\ncolumn: bytes\ntype: Int64\ncount: 12705\nnulls: 142\nsum: 61254321\navg: 4821.000000     ← average response size in bytes\nmin: 0.000000        ← empty responses (304 Not Modified, etc.)\nmax: 982341.000000   ← largest single response\np50: 2048.000000     ← half of responses are smaller than 2 KB\np95: 48291.000000    ← 95% of responses are under ~47 KB\np99: 412847.000000   ← the heaviest 1% of responses\n```\n\n`p50` is the median. `p95` and `p99` are useful for spotting worst-case outliers — for example, even if the average response size is fine, `p99` tells you what the heaviest 1% of responses look like.\n\n---\n\n## Step 3 — Explore an unfamiliar file (sniff)\n\nIf you have a file you've never seen before, `sniff` profiles every column automatically and suggests useful queries:\n\n```bash\n./snout sniff -f mydata.csv\n./snout sniff -f mydata.jsonl\n./snout sniff -f mydata.snout\n./snout sniff -f access.log        # log files work too — format auto-detected\n./snout sniff -f app.log\n```\n\n**Example output:**\n```\ncolumn      type     role       nulls  distinct  details\n----------  -------  ---------  -----  --------  ------------------------------------------\nregion      String   Dimension      0         6  top: us-east (89), us-west (79), eu-west (87)\njitter_ms   Float64  Metric        13       487  min=0.5 mean=56.3 max=99.8 σ=28.4 outliers=3\nroaming     Bool     Metric        22         2  true=68, false=410\nresult      String   Dimension      0         3  top: completed (320), failed (110), dropped (70)\n\nsuggested queries\n-----------------\n1. compare jitter_ms across region\n   ./snout -f mydata.csv group=region -- avg=jitter_ms count=rows --sort avg=jitter_ms desc\n```\n\nSnoutDB reads the column names and classifies each one as a **Dimension** (a category you can group by, like region or country) or a **Metric** (a number you can measure, like delay or price). It then generates ready-to-run query commands for you.\n\nMetric columns also show `σ` (standard deviation) and `outliers` (values more than 3σ from the mean) — a quick signal for anomalous data before you write a single query.\n\n**Example output — access log:**\n```\ncolumn    type       role        nulls   distinct  details\n--------  ---------  ----------  ------  --------  -----------------------------------------\nip        String     Identifier      0     12823  top: 10.0.0.5 (314), 10.0.0.12 (271)\nmethod    String     Dimension       0         4  top: GET (9821), POST (2134), PUT (892)\npath      String     Dimension       0      1047  top: /api/v1/health (823), /api/v1/data (612)\nstatus    Int64      Dimension       0         6  top: 200 (9104), 404 (1823), 500 (312)\nbytes     Int64      Metric        142     11203  min=0 mean=4821 max=982341 σ=12847 outliers=18\ntimestamp Timestamp  Timestamp       0     12847  2026-06-11T00:00:01Z → 2026-06-11T23:59:58Z\n\nsuggested queries\n-----------------\n1. count requests by status\n   ./snout -f access.snout group=status -- count=rows --sort count=rows desc\n2. p95 response size by path\n   ./snout -f access.snout group=path -- p95=bytes count=rows --sort p95=bytes desc --limit 10\n```\n\nOptions:\n- `--top 5` — show the 5 most common values per text column (default: 10)\n- `--suggestions 3` — limit to 3 suggested queries\n- `--format json` — output as JSON for piping to other tools\n\n---\n\n## Hunt — Discover what deserves attention\n\nUse `hunt` when you want SnoutDB to move beyond profiling and rank the strongest\nsignals automatically:\n\n```bash\n./snout hunt mydata.csv\n./snout hunt events.jsonl\n./snout hunt application.log\n./snout hunt dataset.snout\n```\n\nThe default (compact) view fits a triage decision on one screen — severity\noverview, frequent message templates per level, and a ranked list of attention\nfindings where each row carries its own sparkline showing when the events\nhappened inside the file's full timeline:\n\n![Hunt compact triage view — severity stack, frequent patterns, and findings with one-line sparklines](docs/assets/hunt-compact-overview.png)\n\nEach compact finding row is laid out so the most important signal is closest\nto the eye:\n\n```\n  [71]  ERROR   │___________________▁▇▁__________│  (35×)  cache miss key=session:\u003cuuid\u003e\n   │      │              │                        │       │\n   score  severity tag   sparkline over the      events   normalized message template\n                         file's full timeline    counted\n```\n\nSwitch to `--verbose` when you have decided to investigate one of those rows:\n\n```bash\n./snout hunt application.log --verbose\n```\n\nVerbose mode includes:\n\n- findings ordered by severity;\n- full-width temporal histograms;\n- event count and share;\n- peak count and time;\n- first and last match;\n- bounded representative samples;\n- grouped commands to reproduce the evidence.\n\nEach finding shows when the events happened, how concentrated they are, and a\nsample with the variable parts (`\u003cuuid\u003e`, `\u003cn\u003e`, URLs) intact so the original\ncontext is preserved:\n\n![ERROR and WARN log_pattern findings — a steady scatter, a ramp window, and a sharp burst all visible in their sparklines](docs/assets/hunt-error-and-warn-patterns.png)\n\n`--verbose` also lifts the INFO-pattern filter so frequent informational\ntemplates surface alongside the attention findings. The reproduce block is\ngrouped by command — one entry per shared query rather than a repeated line\nper finding:\n\n![INFO patterns from --verbose and the grouped reproduce footer covering multiple findings at once](docs/assets/hunt-info-patterns-and-reproduce.png)\n\nUseful options:\n\n| Option | Meaning |\n|---|---|\n| `--limit \u003cn\u003e` | Maximum ranked findings; `0` means no cap |\n| `--min-score \u003c0..100\u003e` | Minimum score required for a finding |\n| `--verbose` | Detailed evidence, INFO patterns, histograms, and reproduction commands |\n| `--color auto\\|always\\|never` | ANSI color policy for terminal output |\n| `--format table\\|json\\|jsonl` | Human-readable or structured output |\n| `-o report.txt` | Save a color-free text report |\n| `-o report.md` | Save a structured Markdown report |\n| `--logformat \u003cname\u003e` | Override log format detection |\n| `--logpattern \u003cpattern\u003e` | Named-group pattern used with `--logformat regex` |\n| `--strict` | Fail on malformed log records |\n\n`-o` uses the file extension to choose TXT or Markdown and cannot be combined\nwith `--format`.\n\n---\n\n## Step 4 — Ask questions about your data\n\nThe core command groups rows by a category and computes numbers for each group. It works on CSV, JSONL, `.snout`, and — after a quick `log-import` — on log files too.\n\n**Basic pattern:**\n```bash\n./snout -f mydata.csv   group=COLUMN  --  CALCULATION=COLUMN  ...\n./snout -f mydata.snout group=COLUMN  --  CALCULATION=COLUMN  ...\n```\n\n**Log file workflow:** import once, then query as many times as you like:\n```bash\n./snout log-import access.log access.snout\n./snout -f access.snout group=status -- count=rows --sort count=rows desc\n```\n```\nwritten: access.snout\ntable: access\nrows: 12847\ncolumns: 6\n\nstatus  count\n------  -----\n200      9104\n301       892\n403       421\n404      1823\n304       295\n500       312\n```\n\n### Grouping and counting\n\n```bash\n# How many rows per region?\n./snout -f mydata.csv group=region -- count=rows\n```\n```\nregion    count\n--------  -----\nus-east      89\nus-west      79\neu-west      87\neu-east      64\nap-south     71\nap-north    110\n```\n\n```bash\n# How many rows per region AND carrier?\n./snout -f mydata.csv group=region,carrier -- count=rows\n```\n```\nregion    carrier   count\n--------  --------  -----\nus-east   AT\u0026T         31\nus-east   Verizon      28\nus-east   T-Mobile     30\nus-west   AT\u0026T         25\nus-west   Verizon      27\n...\n```\n\n```bash\n# How many requests per HTTP status code?\n./snout -f access.snout group=status -- count=rows --sort count=rows desc\n```\n```\nstatus  count\n------  -----\n200      9104\n404      1823\n301       892\n403       421\n304       295\n500       312\n```\n\n```bash\n# Requests broken down by method AND status\n./snout -f access.snout group=method,status -- count=rows\n```\n```\nmethod  status  count\n------  ------  -----\nGET     200      8210\nGET     301       892\nGET     304       295\nGET     404      1823\nPOST    200       894\nPOST    500       312\n```\n\n### Averages, totals, min, max\n\n```bash\n# Average delay per region\n./snout -f mydata.csv group=region -- avg=jitter_ms\n```\n```\nregion    avg_jitter_ms\n--------  -------------\nus-east           48.30\nus-west           61.20\neu-west           52.70\neu-east           58.90\nap-south          64.10\nap-north          53.80\n```\n\n```bash\n# Average, total, min, and max all at once\n./snout -f mydata.csv group=region -- avg=jitter_ms sum=jitter_ms min=jitter_ms max=jitter_ms\n```\n```\nregion    avg_jitter_ms  sum_jitter_ms  min_jitter_ms  max_jitter_ms\n--------  -------------  -------------  -------------  -------------\nus-east           48.30        4299.70           0.50          98.20\nus-west           61.20        4834.80           1.20          99.80\neu-west           52.70        4584.90           0.80          97.40\neu-east           58.90        3769.60           2.10          96.30\nap-south          64.10        4551.10           1.50          99.10\nap-north          53.80        5918.00           0.50          98.70\n```\n\n```bash\n# Average response size per HTTP method\n./snout -f access.snout group=method -- avg=bytes sum=bytes count=rows\n```\n```\nmethod  avg_bytes  sum_bytes   count\n------  ---------  ----------  -----\nDELETE    1024.00      82944     81\nGET       3821.00   34820810   9115\nHEAD         0.00          0     12\nPOST      8412.00    7522332    894\nPUT       4821.00    4773759    990\n```\n\n### Percentiles — understanding your worst cases\n\n`p50` is the median (the middle value). `p95` means \"95% of values are below this number\" — it tells you what the worst 5% of cases look like. `p99` is the worst 1%.\n\n```bash\n# What does the worst 5% of delay look like per region?\n./snout -f mydata.csv group=region -- p95=jitter_ms p50=jitter_ms count=rows\n```\n```\nregion    p95_jitter_ms  p50_jitter_ms  count\n--------  -------------  -------------  -----\nap-south          97.10          62.40     71\nus-west           96.80          58.90     79\neu-east           95.30          56.20     64\nap-north          94.10          51.30    110\neu-west           93.80          50.10     87\nus-east           91.20          45.80     89\n```\n\n```bash\n# Which endpoints have the largest responses at the 95th percentile?\n./snout -f access.snout group=path -- p95=bytes p50=bytes count=rows \\\n  --sort p95=bytes desc \\\n  --limit 10\n```\n```\npath                   p95_bytes  p50_bytes  count\n---------------------  ---------  ---------  -----\n/api/v1/export            982341      48291    312\n/api/v1/upload            721834      24182    891\n/api/v1/reports           312481       8192    421\n/api/v1/data              124821       4821   2134\n/api/v1/search             48291       2048   3821\n/api/v1/users              24182       1024   1203\n/api/v1/health              1024        512   4823\n```\n\nYou can use any number from 0 to 99: `p50`, `p75`, `p90`, `p95`, `p99`.\n\n### Error rate — fraction of \"true\" values\n\nIf a column holds true/false values (like `roaming` or `failed`), `error_rate` tells you what fraction of rows are `true`:\n\n```bash\n# What fraction of calls were roaming, per region?\n./snout -f mydata.csv group=region -- error_rate=roaming count=rows\n```\n```\nregion    error_rate_roaming  count\n--------  ------------------  -----\nap-south                0.28     71\nus-west                 0.21     79\nap-north                0.18    110\neu-east                 0.14     64\nus-east                 0.12     89\neu-west                 0.09     87\n```\n\n```bash\n# What fraction of requests ended in error, per service? (logfmt logs)\n./snout -f app.snout group=service -- error_rate=error count=rows \\\n  --sort error_rate=error desc\n```\n```\nservice    error_rate_error  count\n---------  ----------------  -----\npayments              0.410    744\ninventory             0.120    321\nauth                  0.020   1187\ngateway               0.010   2341\n```\n\nA result of `0.41` means 41% of rows in that group had `error=true`.\n\n### Distinct count — how many unique values per group?\n\nUse `count_distinct` to count unique values of one column within each group, without pulling all the data out:\n\n```bash\n# How many distinct carriers appear per region?\n./snout -f mydata.csv group=region -- count_distinct=carrier count=rows\n```\n```\nregion    count_distinct_carrier  count\n--------  ----------------------  -----\nus-east                        4     89\nus-west                        4     79\neu-west                        3     87\neu-east                        3     64\nap-south                       2     71\nap-north                       3    110\n```\n\n```bash\n# How many unique IPs hit each endpoint?\n./snout -f access.snout group=path -- count_distinct=ip count=rows \\\n  --sort count=rows desc \\\n  --limit 5\n```\n```\npath                   count_distinct_ip  count\n---------------------  -----------------  -----\n/api/v1/health                      8231   4823\n/api/v1/search                      3214   3821\n/api/v1/data                        1821   2134\n/api/v1/users                       1102   1203\n/api/v1/upload                       312    891\n```\n\n```bash\n# Combine with other aggregates\n./snout -f mydata.csv group=region -- avg=jitter_ms count_distinct=carrier count=rows \\\n  --sort avg=jitter_ms desc\n```\n```\nregion    avg_jitter_ms  count_distinct_carrier  count\n--------  -------------  ----------------------  -----\nap-south          64.10                       2     71\nus-west           61.20                       4     79\neu-east           58.90                       3     64\nap-north          53.80                       3    110\neu-west           52.70                       3     87\nus-east           48.30                       4     89\n```\n\nThe result column is named `count_distinct_carrier`. You can use `count_distinct` on any column type — strings, numbers, or booleans.\n\n### Filtering rows before counting\n\nUse `--where` to focus on a subset:\n\n```bash\n# Only look at completed calls\n./snout -f mydata.csv group=region -- avg=jitter_ms count=rows \\\n  --where result eq completed\n```\n```\nregion    avg_jitter_ms  count\n--------  -------------  -----\nus-east           44.10    298\nus-west           57.30    241\neu-west           49.80    271\neu-east           54.20    198\nap-south          59.70    204\nap-north          49.20    323\n```\n\n```bash\n# Only 5xx errors — which paths are broken?\n./snout -f access.snout group=path -- count=rows \\\n  --where status ge 500 \\\n  --sort count=rows desc\n```\n```\npath                   count\n---------------------  -----\n/api/v1/export           182\n/api/v1/upload           130\n/api/v1/data              84\n```\n\n```bash\n# Combine filters (all conditions must be true)\n./snout -f mydata.csv group=region -- avg=jitter_ms count=rows \\\n  --where result eq completed \\\n  --where jitter_ms not-null\n```\n```\nregion    avg_jitter_ms  count\n--------  -------------  -----\nus-east           44.10    286\nus-west           57.30    229\neu-west           49.80    259\neu-east           54.20    191\nap-south          59.70    196\nap-north          49.20    311\n```\n\n```bash\n# Search inside log messages\n./snout -f warp.log group=level,message -- count=rows \\\n  --where message icontains telemetry \\\n  --sort count=rows desc\n```\n\n**Filter operators:**\n| Operator | Meaning |\n|----------|---------|\n| `eq`       | equals |\n| `ne`       | not equals |\n| `lt`       | less than |\n| `le`       | less than or equal |\n| `gt`       | greater than |\n| `ge`       | greater than or equal |\n| `contains` | string contains text (case-sensitive) |\n| `not-contains` | string does not contain text |\n| `icontains` | string contains text (ASCII case-insensitive) |\n| `is-null`  | value is missing |\n| `not-null` | value is present |\n\n### Sorting and limiting results\n\n```bash\n# Show the 3 regions with the highest average delay\n./snout -f mydata.csv group=region -- avg=jitter_ms count=rows \\\n  --sort avg=jitter_ms desc \\\n  --limit 3\n```\n```\nregion    avg_jitter_ms  count\n--------  -------------  -----\nap-south          64.10     71\nus-west           61.20     79\neu-east           58.90     64\n```\n\n```bash\n# Top 5 paths by request volume in access logs\n./snout -f access.snout group=path -- count=rows p95=bytes \\\n  --sort count=rows desc \\\n  --limit 5\n```\n```\npath                   count  p95_bytes\n---------------------  -----  ---------\n/api/v1/health          4823       1024\n/api/v1/search          3821      48291\n/api/v1/data            2134     124821\n/api/v1/users           1203      24182\n/api/v1/upload           891     721834\n```\n\n```bash\n# Sort log errors by count, break ties by response size\n./snout -f access.snout group=path,status -- count=rows p99=bytes \\\n  --where status ge 400 \\\n  --sort count=rows desc \\\n  --sort p99=bytes desc\n```\n```\npath                   status  count  p99_bytes\n---------------------  ------  -----  ---------\n/api/v1/data             404      84    982341\n/api/v1/export           500      82    821934\n/api/v1/upload           500      48    721834\n/api/v1/search           404      31     48291\n```\n\n### Output formats\n\nBy default results are shown as a table. For scripts or piping to other tools:\n\n```bash\n./snout -f mydata.csv group=region -- avg=jitter_ms --format csv\n```\n```\nregion,avg_jitter_ms\nap-south,64.100000\nus-west,61.200000\neu-east,58.900000\nap-north,53.800000\neu-west,52.700000\nus-east,48.300000\n```\n\n```bash\n./snout -f mydata.csv group=region -- avg=jitter_ms --format json\n```\n```json\n[\n  {\"region\": \"ap-south\", \"avg_jitter_ms\": 64.1},\n  {\"region\": \"us-west\",  \"avg_jitter_ms\": 61.2},\n  {\"region\": \"eu-east\",  \"avg_jitter_ms\": 58.9}\n]\n```\n\n```bash\n# Export log analysis as JSONL for a dashboard or downstream script\n./snout -f access.snout group=status -- count=rows --format jsonl\n```\n```\n{\"status\": \"200\", \"count\": 9104}\n{\"status\": \"404\", \"count\": 1823}\n{\"status\": \"301\", \"count\": 892}\n{\"status\": \"403\", \"count\": 421}\n{\"status\": \"304\", \"count\": 295}\n{\"status\": \"500\", \"count\": 312}\n```\n\n---\n\n## Step 5 — Save your data as a .snout file\n\nImport a raw file once to create a typed, reusable snapshot. See\n[What is a `.snout` file?](#what-is-a-snout-file) for its layout, benefits,\nand current limitations.\n\n```bash\n# Convert CSV → .snout\n./snout csv-import mydata.csv mydata.snout\n```\n```\nwritten: mydata.snout\ntable: mydata\nrows: 500\ncolumns: 5\n```\n\n```bash\n# Convert JSONL → .snout\n./snout jsonl-import events.jsonl events.snout\n```\n```\nwritten: events.snout\ntable: events\nrows: 14821\ncolumns: 5\n```\n\n```bash\n# Convert a log file → .snout (format auto-detected)\n./snout log-import access.log access.snout\n```\n```\nwritten: access.snout\ntable: access\nrows: 12847\ncolumns: 6\n```\n\n```bash\n# Inspect what's inside\n./snout info mydata.snout\n```\n```\ntable: mydata\nrows: 500\ncolumns:\n  region       String    nullable=false\n  carrier      String    nullable=false\n  jitter_ms    Float64   nullable=true\n  roaming      Bool      nullable=true\n  result       String    nullable=false\n```\n\n```bash\n./snout stats mydata.snout jitter_ms\n```\n```\ncolumn: jitter_ms\ntype: Float64\ncount: 500\nnulls: 12\nsum: 27849.400000\navg: 55.981124\nmin: 0.500000\nmax: 99.800000\np50: 54.200000\np95: 94.100000\np99: 98.700000\n```\n\n```bash\n./snout stats access.snout bytes\n```\n```\ncolumn: bytes\ntype: Int64\ncount: 12847\nnulls: 0\nsum: 47821904\navg: 3723\nmin: 0\nmax: 982341\np50: 2048\np95: 124821\np99: 721834\n```\n\n```bash\n# Query it just like a CSV\n./snout -f mydata.snout group=region -- avg=jitter_ms count=rows\n./snout -f access.snout group=status -- count=rows --sort count=rows desc\n```\n```\nregion    avg_jitter_ms  count\n--------  -------------  -----\nus-east           48.30     89\nus-west           61.20     79\neu-west           52.70     87\neu-east           58.90     64\nap-south          64.10     71\nap-north          53.80    110\n\nstatus  count\n------  -----\n200      9104\n404      1823\n301       892\n403       421\n304       295\n500       312\n```\n\nThe file can now be reused by query, transform, merge, rollup, and C ABI\nworkflows without repeating raw-text schema inference.\n\n---\n\n## Step 6 — Combine multiple files\n\nIf you collect data in separate files (one per day, per source, etc.), you can merge them. A common pattern with logs: import each daily log file into `.snout`, then consolidate:\n\n```bash\n# Import three days of access logs\n./snout log-import access-2026-06-09.log day1.snout\n./snout log-import access-2026-06-10.log day2.snout\n./snout log-import access-2026-06-11.log day3.snout\n```\n```\nwritten: day1.snout  table: access  rows: 12847  columns: 6\nwritten: day2.snout  table: access  rows: 13201  columns: 6\nwritten: day3.snout  table: access  rows: 12493  columns: 6\n```\n\n```bash\n# Combine into one file\n./snout consolidate day1.snout day2.snout day3.snout week.snout\n```\n```\nwritten: week.snout\ntable: week\nrows: 38541\ncolumns: 6\n```\n\n```bash\n# Or append one more day to an existing archive\n./snout append archive.snout day3.snout updated.snout\n```\n```\nwritten: updated.snout\ntable: updated\nrows: 25340\ncolumns: 6\n```\n\n```bash\n# Merge and aggregate in one step — daily request totals by status\n./snout rollup day1.snout day2.snout day3.snout summary.snout \\\n  group=status -- count=rows\n```\n```\nwritten: summary.snout\ntable: summary\nrows: 6\ncolumns: 2\n```\n\n```bash\n# Or classic metrics rollup\n./snout rollup jan.snout feb.snout mar.snout q1.snout group=region -- count=rows avg=latency_ms\n```\n```\nwritten: q1.snout\ntable: q1\nrows: 6\ncolumns: 3\n```\n\nColumns don't need to match exactly. If one file has a column the other doesn't, the missing rows are filled with empty/null values automatically. If the same column holds different types (say, integers in one file and decimals in another), SnoutDB promotes to the wider type automatically.\n\nThe rollup output is a regular `.snout` file with one row per group. Aggregate columns are named after the function and source column — `count`, `avg_latency_ms`, `p95_bytes` — so you can query the result like any other file:\n\n```bash\n# Logs: aggregate three days of access logs by status, then inspect the weekly totals\n./snout rollup day1.snout day2.snout day3.snout week_summary.snout \\\n  group=status -- count=rows p95=bytes\n```\n```\nwritten: week_summary.snout\ntable: week_summary\nrows: 6\ncolumns: 3\n```\n\n```bash\n./snout -f week_summary.snout group=status -- sum=count avg=avg_p95_bytes \\\n  --sort sum=count desc\n```\n```\nstatus  sum_count  avg_avg_p95_bytes\n------  ---------  -----------------\n200         27312             2048.0\n404          5469            48291.0\n301          2676             1024.0\n403          1263             4096.0\n304           885              512.0\n500           936           821934.0\n```\n\n---\n\n## Step 7 — Reshape your data (transform)\n\nOnce data is in a `.snout` file, you can reshape it before querying. Multiple operations can be chained in a single command:\n\n```bash\n# Rename a column\n./snout transform in.snout out.snout rename=duration_seconds:duration_s\n# Log: rename the CLF \"bytes\" column to something clearer\n./snout transform access.snout access.snout rename=bytes:response_bytes\n\n# Change a column's type\n./snout transform in.snout out.snout cast=sip_code:string\n# Log: cast the status code to string so it groups as a label\n./snout transform access.snout access.snout cast=status:string\n\n# Add a computed column (binary expression: +, -, *, /)\n./snout transform in.snout out.snout derive=total_delay:jitter_ms+rtt_ms\n# Log: compute kilobytes from bytes\n./snout transform access.snout access.snout derive=response_kb:bytes/1024\n\n# Bin a numeric column into labelled tiers\n# Format: bucket=col:edge1,edge2,...:label1,label2,...:output_col\n# Values below the first edge or above the last get NULL\n./snout transform in.snout out.snout bucket=latency_ms:0,100,500:fast,slow:speed_tier\n# Log: classify HTTP status codes into ok / redirect / client_error / server_error\n./snout transform access.snout access.snout \\\n  bucket=status:0,300,400,500,600:ok,redirect,client_error,server_error:status_class\n\n# Truncate timestamps to a time unit (year, month, day, hour, minute)\n./snout transform in.snout out.snout date_trunc=timestamp:hour\n# Log: group access log entries by hour for time-series analysis\n./snout transform access.snout access_by_hour.snout date_trunc=timestamp:hour\n\n# Extract a regex capture group into a new column\n# Format: regex_extract=source_col:pattern:group_number:output_col\n./snout transform in.snout out.snout regex_extract=path:/users/([0-9]+)/:1:user_id\n# Log: extract the top-level endpoint from paths like /api/v1/users/42\n./snout transform access.snout access.snout regex_extract=path:^(/[^/?]+):1:endpoint\n\n# Extract a field from a JSON string column\n# Format: json_extract=source_col:key:output_col\n./snout transform in.snout out.snout json_extract=meta:env:environment\n# Log: logfmt logs often have a JSON \"meta\" field — extract the service name\n./snout transform app.snout app.snout json_extract=meta:service:service_name\n```\n\nYou can chain multiple operations in one pass:\n\n```bash\n# Generic metrics pipeline\n./snout transform raw.snout clean.snout \\\n  rename=duration_seconds:duration_s \\\n  derive=total_delay:jitter_ms+rtt_ms \\\n  bucket=total_delay:0,100,500:fast,slow:speed_tier \\\n  date_trunc=timestamp:hour\n```\n```\nwritten: clean.snout\ntable: clean\nrows: 500\ncolumns: 7\n```\n\n```bash\n# Access log enrichment pipeline — one command, one pass\n./snout log-import access.log access.snout\n./snout transform access.snout access_enriched.snout \\\n  date_trunc=timestamp:hour \\\n  regex_extract=path:^(/[^/?]+):1:endpoint \\\n  bucket=status:0,300,400,500,600:ok,redirect,client_error,server_error:status_class \\\n  derive=response_kb:bytes/1024\n```\n```\nwritten: access.snout\ntable: access\nrows: 12847\ncolumns: 6\n\nwritten: access_enriched.snout\ntable: access_enriched\nrows: 12847\ncolumns: 10\n```\n\n**Log file example** — enrich an access log after import, then query the enriched file:\n\n```bash\n./snout log-import access.log access.snout\n\n./snout transform access.snout access_enriched.snout \\\n  date_trunc=timestamp:hour \\\n  regex_extract=path:^(/[^/?]+):1:endpoint \\\n  bucket=status:0,400,500,600:ok,client_error,server_error:status_class\n```\n```\nwritten: access_enriched.snout\ntable: access_enriched\nrows: 12847\ncolumns: 9\n```\n\n```bash\n# Now query by hour and endpoint\n./snout -f access_enriched.snout group=timestamp,endpoint -- count=rows p95=bytes \\\n  --where status_class eq server_error \\\n  --sort count=rows desc\n```\n```\ntimestamp             endpoint      count  p95_bytes\n--------------------  ------------  -----  ---------\n2026-06-11T15:00:00Z  /api           182    821934\n2026-06-11T14:00:00Z  /api           130    721834\n2026-06-11T16:00:00Z  /api            84    124821\n2026-06-11T13:00:00Z  /static         31     48291\n```\n\n---\n\n## Step 8 — Analyze log files\n\nSnoutDB auto-detects the format of `.log`, `.access`, and `.error` files. You only need `--format` when the file extension is ambiguous or when using a custom regex pattern:\n\n```bash\n# Start with an automatic ranked investigation\n./snout hunt application.log\n\n# Expand every finding and save a shareable Markdown report\n./snout hunt application.log --verbose -o application-hunt.md\n```\n\n```bash\n# Schema inspection — format is auto-detected from content\n./snout log-info access.log\n```\n```\ntable: access\nrows: 12847\nparse_errors: 0\nformat: combined\ncolumns:\n  ip           String     nullable=false\n  timestamp    Timestamp  nullable=false\n  method       String     nullable=false\n  path         String     nullable=false\n  status       Int64      nullable=false\n  bytes        Int64      nullable=true\n  referer      String     nullable=true\n  user_agent   String     nullable=true\n```\n\n```bash\n./snout log-info app.log\n```\n```\ntable: app\nrows: 4593\nparse_errors: 0\nformat: logfmt\ncolumns:\n  timestamp    Timestamp  nullable=false\n  level        String     nullable=false\n  service      String     nullable=false\n  msg          String     nullable=false\n  latency_ms   Float64    nullable=true\n  error        Bool       nullable=true\n```\n\n```bash\n# Import to .snout for fast querying\n./snout log-import access.log access.snout\n```\n```\nwritten: access.snout\ntable: access\nrows: 12847\ncolumns: 8\n```\n\n```bash\n# Profile directly (no import needed)\n./snout sniff -f access.log\n```\n```\ncolumn       type       role        nulls   distinct  details\n-----------  ---------  ----------  ------  --------  --------------------------------------------------------\nip           String     Identifier      0      8231  (high cardinality — 8231 unique values)\ntimestamp    Timestamp  Timestamp       0     12847  2026-06-11T00:00:03Z → 2026-06-11T23:59:58Z\nmethod       String     Dimension       0         5  top: GET (9115), POST (894), PUT (990)\npath         String     Identifier      0      2341  (high cardinality — 2341 unique values)\nstatus       Int64      Metric          0         6  min=200 mean=231 max=504 σ=82 outliers=0\nbytes        Int64      Metric          0      4821  min=0 mean=3723 max=982341 σ=14821 outliers=23\nreferer      String     Dimension     891       412  top: https://example.com (1203), - (4821)\nuser_agent   String     Identifier      0      1821  (high cardinality — 1821 unique values)\n\nsuggested queries\n-----------------\n1. compare bytes across method\n   ./snout -f access.snout group=method -- avg=bytes p95=bytes count=rows\n2. compare bytes across status\n   ./snout -f access.snout group=status -- avg=bytes p95=bytes count=rows\n3. find outlier bytes values (23 detected beyond 3σ)\n   ./snout -f access.snout group=path -- count=rows --where bytes gt 58086 --sort count=rows desc\n```\n\n```bash\n./snout sniff -f app.log\n```\n```\ncolumn       type       role        nulls   distinct  details\n-----------  ---------  ----------  ------  --------  --------------------------------------------------------\ntimestamp    Timestamp  Timestamp       0      4593  2026-06-11T13:00:01Z → 2026-06-11T16:59:58Z\nlevel        String     Dimension       0         4  top: info (2841), warn (891), error (744), debug (117)\nservice      String     Dimension       0         4  top: gateway (2341), auth (1187), inventory (321), payments (744)\nmsg          String     Identifier      0       892  (high cardinality — 892 unique values)\nlatency_ms   Float64    Metric        214      3821  min=0.20 mean=42.10 max=8921.00 σ=312.40 outliers=19\nerror        Bool       Metric          0         2  true=892, false=3701\n\nsuggested queries\n-----------------\n1. compare latency_ms across service\n   ./snout -f app.snout group=service -- avg=latency_ms p95=latency_ms count=rows\n2. error rate by service\n   ./snout -f app.snout group=service -- error_rate=error count=rows --sort error_rate=error desc\n3. find outlier latency_ms values (19 detected beyond 3σ)\n   ./snout -f app.snout group=service -- count=rows --where latency_ms gt 979 --sort count=rows desc\n```\n\n```bash\n# Override auto-detect when needed\n./snout log-info app.log --format logfmt\n```\n```\ntable: app\nrows: 4593\nparse_errors: 0\nformat: logfmt\ncolumns:\n  timestamp    Timestamp  nullable=false\n  level        String     nullable=false\n  service      String     nullable=false\n  msg          String     nullable=false\n  latency_ms   Float64    nullable=true\n  error        Bool       nullable=true\n```\n\n```bash\n# Custom format with named regex groups\n./snout log-import custom.log out.snout \\\n  --format regex \\\n  --pattern '(?P\u003cip\u003e\\S+) \\[(?P\u003cts\u003e[^\\]]+)\\] \"(?P\u003cmethod\u003e\\S+) (?P\u003cpath\u003e\\S+)\" (?P\u003cstatus\u003e\\d+)'\n```\n```\nwritten: out.snout\ntable: out\nrows: 8421\ncolumns: 5\n```\n\n**Supported formats:**\n- `clf` — Apache/Nginx Common Log Format\n- `combined` — CLF plus `referer` and `user_agent` columns\n- `logfmt` — `key=value` pairs (used by Logrus, Zap, etc.)\n- `syslog` — RFC 3164 (`Jun 11 10:00:01 host app[pid]: message`), with or without PRI prefix (`\u003c134\u003e`)\n- `app` — application logs in `YYYY-MM-DD HH:MM:SS [level] message` format\n- `bracketed` — application logs with bracketed levels and mixed ISO timestamps\n- `regex` — custom format with `(?P\u003cname\u003e...)` named groups\n\nCLF timestamps are converted to ISO-8601 UTC automatically. Syslog timestamps use a `0000-MM-DD` year placeholder (RFC 3164 does not include a year).\n\n---\n\n## Step 9 — Embed SnoutDB in your application (C API)\n\nSnoutDB ships a shared library (`libsnout`) with an experimental C ABI so you\ncan call it from any language that supports FFI. The ABI may change before\n`v1.0.0`.\n\n**Build the library:**\n\n```bash\n./scripts/build-cabi.sh          # → libsnout.dylib (macOS) / libsnout.so (Linux)\n```\n\n**Include the header:**\n\n```c\n#include \"include/snoutdb.h\"\n```\n\n**Example — load a CSV and run a group query from C:**\n\n```c\nSnoutTable* t = snout_import_csv(\"calls.csv\");\n\n// avg(jitter_ms) + count(*) by region, sorted desc\nSnoutResult* r = snout_query(t,\n    \"region\",               // group by\n    \"avg=jitter_ms count=rows\",\n    NULL, 0,                // no filters\n    \"avg=jitter_ms desc\",   // sort\n    0                       // no limit\n);\n\nint rows = snout_result_row_count(r);\nint cols = snout_result_col_count(r);\nfor (int row = 0; row \u003c rows; row++) {\n    for (int col = 0; col \u003c cols; col++) {\n        printf(\"%s  \", snout_result_get_string(r, row, col));\n    }\n    printf(\"\\n\");\n}\n\nsnout_result_free(r);\nsnout_close(t);\n```\n\nLog ingestion is currently exposed through the CLI. Convert a log to `.snout`\nwith `snout log-import`, then open it from the C API with `snout_open`.\n\nThe same API works from Python (`ctypes`), Go (`cgo`), and any other language\nwith C FFI. See [`examples/`](examples/README.md) for ready-to-run demos.\n\n**API overview:**\n\n| Function | Description |\n|---|---|\n| `snout_import_csv(path)` | Load a CSV file into an in-memory table |\n| `snout_import_jsonl(path)` | Load a JSONL file |\n| `snout_open(path)` | Open an existing `.snout` file |\n| `snout_close(t)` | Free the table |\n| `snout_row_count(t)` | Number of rows |\n| `snout_column_count(t)` | Number of columns |\n| `snout_column_name(t, col)` | Column name by index |\n| `snout_column_type(t, col)` | Column type (`SNOUT_TYPE_*` constant) |\n| `snout_is_null(t, row, col)` | 1 if the cell is null |\n| `snout_get_string/int64/float64/bool(t, row, col)` | Read a cell value |\n| `snout_query(t, groups, aggs, where, n, sort, limit)` | Group-by aggregation |\n| `snout_result_free(r)` | Free a query result |\n| `snout_result_row/col_count(r)` | Result dimensions |\n| `snout_result_get_*(r, row, col)` | Read a result cell |\n| `snout_last_error()` | Last error message (thread-local) |\n\nColumn type constants: `SNOUT_TYPE_STRING=0`, `SNOUT_TYPE_INT64=1`, `SNOUT_TYPE_FLOAT64=2`, `SNOUT_TYPE_BOOL=3`, `SNOUT_TYPE_TIMESTAMP=4`.\n\nThe full header is in [`include/snoutdb.h`](include/snoutdb.h).\n\n---\n\n## Large files\n\nSnoutDB profiles large CSV, JSONL, and log files through streaming readers\nwithout first materializing a complete `core.Table`. Exact cardinality tracking\nis bounded by the `--max-distinct` setting.\n\n```bash\n# Profile a large file\n./snout sniff -f bigfile.csv\n./snout sniff -f bigfile.jsonl\n./snout sniff -f bigfile.snout\n./snout sniff -f bigfile.log      # log files stream too\n```\n\nSee [benchmarks/README.md](benchmarks/README.md) for the current environment,\nmethodology, commands, and results.\n\n---\n\n## Timing\n\nEvery command prints how long it took to stderr, so your stdout stays clean. This works for every file type — CSV, JSONL, log files, and `.snout`:\n\n```bash\n./snout -f access.snout group=status -- count=rows\n# stdout → the result table\n# stderr → Elapsed: 1.42ms.\n\n./snout sniff -f access.log\n# stdout → the sniff report\n# stderr → Elapsed: 38.7ms.\n```\n\nYou can safely redirect stdout to a file or pipe without capturing the timing line:\n\n```bash\n./snout -f access.snout group=status -- count=rows --format json \u003e report.json\n# report.json gets only the data; the timing line never enters the file\n```\n\n---\n\n## Quick reference\n\n| What you want | Command |\n|---|---|\n| Show the current version | `./snout version` |\n| See column names and types (CSV) | `./snout csv-info file.csv` |\n| See column names and types (log) | `./snout log-info access.log` |\n| Stats on one column (CSV) | `./snout csv-stats file.csv column` |\n| Stats on one column (log) | `./snout log-import f.log f.snout \u0026\u0026 ./snout stats f.snout bytes` |\n| Auto-explore and get query ideas (CSV) | `./snout sniff -f file.csv` |\n| Auto-explore and get query ideas (log) | `./snout sniff -f access.log` |\n| Automatically rank findings | `./snout hunt file.log` |\n| Inspect full Hunt evidence | `./snout hunt file.log --verbose` |\n| Export a Markdown Hunt report | `./snout hunt file.log --verbose -o report.md` |\n| Export a text Hunt report | `./snout hunt file.log -o report.txt` |\n| Emit Hunt JSON | `./snout hunt file.log --format json` |\n| Sniff from stdin (auto-detects CSV/JSONL) | `cat file.csv \\| ./snout sniff -f -` |\n| Query data from stdin | `cat file.csv \\| ./snout -f - group=col -- count=rows` |\n| Query application logs from stdin | `cat app.log \\| ./snout -f - group=level,message -- count=rows --logformat app` |\n| Count rows per group | `./snout -f file.csv group=col -- count=rows` |\n| Count log requests by status | `./snout -f access.snout group=status -- count=rows` |\n| Average per group | `./snout -f file.csv group=col -- avg=col2` |\n| Average response size per endpoint | `./snout -f access.snout group=path -- avg=bytes count=rows` |\n| Worst-case percentile per group | `./snout -f file.csv group=col -- p95=col2` |\n| Worst-case response size per endpoint | `./snout -f access.snout group=path -- p95=bytes p99=bytes` |\n| Error/true rate per group | `./snout -f file.csv group=col -- error_rate=bool_col` |\n| Error rate per service (logfmt) | `./snout -f app.snout group=service -- error_rate=error count=rows` |\n| Distinct values per group | `./snout -f file.csv group=col -- count_distinct=col2` |\n| Unique IPs per endpoint | `./snout -f access.snout group=path -- count_distinct=ip count=rows` |\n| Filter then aggregate | add `--where col op value` |\n| Filter only 5xx errors | add `--where status ge 500` |\n| Sort results | add `--sort agg=col desc` |\n| Output as JSON | add `--format json` |\n| Save as .snout | `./snout csv-import file.csv out.snout` |\n| Merge two .snout files | `./snout append a.snout b.snout out.snout` |\n| Merge many .snout files | `./snout consolidate a.snout b.snout c.snout out.snout` |\n| Compact a .snout file | `./snout compact messy.snout clean.snout` |\n| Merge + aggregate into a summary | `./snout rollup a.snout b.snout out.snout group=col -- count=rows avg=col2` |\n| Query a rollup summary | `./snout -f summary.snout group=col -- sum=count avg=avg_col2` |\n| Rename a column | `./snout transform in.snout out.snout rename=old:new` |\n| Cast column type | `./snout transform in.snout out.snout cast=col:float64` |\n| Compute new column (binary expr) | `./snout transform in.snout out.snout derive=total:col1+col2` |\n| Bin values into labels | `./snout transform in.snout out.snout bucket=latency:0,100,500:fast,slow:tier` |\n| Truncate timestamps | `./snout transform in.snout out.snout date_trunc=ts:day` |\n| Extract regex group | `./snout transform in.snout out.snout regex_extract=url:/users/([0-9]+)/:1:uid` |\n| Extract JSON field | `./snout transform in.snout out.snout json_extract=payload:env:environment` |\n| Inspect a log file (auto-detect) | `./snout log-info access.log` |\n| Import log to .snout (auto-detect) | `./snout log-import access.log out.snout` |\n| Override log format | `./snout log-info app.log --format logfmt` |\n| Profile a log file | `./snout sniff -f access.log` |\n| Stats on a log column | `./snout log-import f.log f.snout \u0026\u0026 ./snout stats f.snout bytes` |\n| Count requests by status | `./snout -f access.snout group=status -- count=rows` |\n| Top endpoints by error count | `./snout -f access.snout group=path -- count=rows --where status ge 500 --sort count=rows desc` |\n| Unique IPs per endpoint | `./snout -f access.snout group=path -- count_distinct=ip count=rows` |\n| Requests per hour | `./snout transform in.snout out.snout date_trunc=timestamp:hour \u0026\u0026 ./snout -f out.snout group=timestamp -- count=rows` |\n| Combine daily log imports | `./snout consolidate day1.snout day2.snout day3.snout week.snout` |\n| Build the C shared library | `./scripts/build-cabi.sh` |\n| Run Python example | `python3 examples/python/snout_example.py` |\n| Run Go example | `cd examples/go \u0026\u0026 go run main.go` |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacovinus%2Fsnoutdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjacovinus%2Fsnoutdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacovinus%2Fsnoutdb/lists"}