{"id":26288914,"url":"https://github.com/clojure-finance/datajure","last_synced_at":"2026-04-20T06:13:22.642Z","repository":{"id":103950031,"uuid":"456021891","full_name":"clojure-finance/datajure","owner":"clojure-finance","description":"Clojure data manipulation DSL — composable query syntax built on tech.ml.dataset","archived":false,"fork":false,"pushed_at":"2026-04-17T03:55:05.000Z","size":286,"stargazers_count":14,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-17T04:17:43.450Z","etag":null,"topics":["clojure","data-manipulation","data-science","dataframe","dsl","empirical-research","query-dsl","tech-ml-dataset"],"latest_commit_sha":null,"homepage":"https://clojure-finance.github.io/datajure-website","language":"Clojure","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clojure-finance.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-02-06T00:42:57.000Z","updated_at":"2026-04-17T03:55:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"deac434e-3ca7-493b-ba1f-ed00e1f3a6dd","html_url":"https://github.com/clojure-finance/datajure","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/clojure-finance/datajure","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clojure-finance%2Fdatajure","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clojure-finance%2Fdatajure/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clojure-finance%2Fdatajure/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clojure-finance%2Fdatajure/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clojure-finance","download_url":"https://codeload.github.com/clojure-finance/datajure/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clojure-finance%2Fdatajure/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31955943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T00:39:45.007Z","status":"online","status_checked_at":"2026-04-18T02:00:07.018Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clojure","data-manipulation","data-science","dataframe","dsl","empirical-research","query-dsl","tech-ml-dataset"],"created_at":"2025-03-14T22:15:33.357Z","updated_at":"2026-04-20T06:13:22.626Z","avatar_url":"https://github.com/clojure-finance.png","language":"Clojure","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datajure v2\n\n[![Clojars Project](https://img.shields.io/clojars/v/com.github.clojure-finance/datajure.svg)](https://clojars.org/com.github.clojure-finance/datajure)\n[![CI](https://github.com/clojure-finance/datajure/actions/workflows/tests.yml/badge.svg)](https://github.com/clojure-finance/datajure/actions/workflows/tests.yml)\n[![cljdoc](https://cljdoc.org/badge/com.github.clojure-finance/datajure)](https://cljdoc.org/d/com.github.clojure-finance/datajure/CURRENT)\n\n**One function. Seven keywords. Two expression modes.**\n\nDatajure is a Clojure data manipulation library built on [tech.ml.dataset](https://github.com/techascent/tech.ml.dataset). It provides a clean, composable query DSL for filtering, transforming, grouping, and aggregating tabular data.\n\n```clojure\n(require '[datajure.core :refer [dt nrow asc desc]])\n\n;; Filter, group, aggregate — one call\n(dt ds\n  :where #dt/e (\u003e :year 2008)\n  :by [:species]\n  :agg {:n nrow :avg #dt/e (mn :mass)})\n\n;; Window functions — same keywords, no new concepts\n(dt ds\n  :by [:species]\n  :within-order [(desc :mass)]\n  :set {:rank #dt/e (win/rank :mass)})\n\n;; OHLC bars in one call — :within-order with :agg sorts each group first\n(dt trades\n  :by [:sym]\n  :within-order [(asc :time)]\n  :agg {:open  #dt/e (first-val :price)\n        :close #dt/e (last-val :price)\n        :hi    #dt/e (mx :price)\n        :lo    #dt/e (mi :price)\n        :vol   #dt/e (sm :size)})\n\n;; Thread for multi-step pipelines\n(-\u003e ds\n    (dt :set {:bmi #dt/e (/ :mass (sq :height))})\n    (dt :by [:species] :agg {:avg-bmi #dt/e (mn :bmi)})\n    (dt :order-by [(desc :avg-bmi)]))\n```\n\nDatajure is a **syntax layer**, not an engine — it compiles `#dt/e` expressions to vectorized operations and delegates all computation to `tech.v3.dataset`. Every result is a standard `tech.v3.dataset` dataset. Full interop with tablecloth, Clerk, Clay, and the Scicloj ecosystem.\n\n## Why Datajure\n\nDatajure takes inspiration from whichever library or language got a given idea right — R's `data.table` (terse query form, single-expression semantics), APL/q/kdb+ (first-class primitives for time-series operations you use every day), Polars (expressions as values, composable vocabulary), Julia's `DataFramesMeta.jl` (one function with keyword arguments, not twenty-eight verbs). The goal is not to be any of them. It is to combine the parts that were genuinely revelations.\n\nConcretely, if you've used:\n\n- **R's `data.table`** — you'll find `DT[i, j, by]` maps directly onto `(dt ds :where i :set-or-agg j :by by)`. Nil handling is cleaner than data.table's `NA`. There is no in-place mutation (Datajure is immutable) and no secondary indexes (`setkey`); tech.v3.dataset's columnar layout is fast enough without them.\n- **Python's pandas/Polars** — you get expression objects as values (like Polars' `Expr`), nil-safe comparisons and arithmetic by default, and a single query form rather than a pipeline of a dozen verbs.\n- **R's `dplyr` or tidyverse** — you'll find the same pipe-friendly composition (`-\u003e` is Clojure's pipe), with less verbosity and without the function-per-verb proliferation.\n- **Julia's `DataFramesMeta.jl`** — the `#dt/e` reader tag serves the same role as DFM's `@transform`/`@subset`, but because Clojure has a real reader tag mechanism (rather than macros pretending to parse expressions), it integrates more cleanly with the rest of the language.\n- **q/kdb+** — the `win/*` namespace gives you first-class `deltas`, `ratios`, `mavg`, `msum`, `mdev`, `ema`, `fills`, `scan`, `each-prior`, plus `wavg`, `wsum`, `first`, `last` as aggregation primitives. `xbar` ships for time-series bar generation. As-of joins with `:direction` and `:tolerance` and window joins (`:how :window`) are built in.\n\nDatajure's unique wedge is that `#dt/e` expressions are first-class AST values — you can store them in vars and compose them across queries. Build a shared vocabulary once, reuse it everywhere:\n\n```clojure\n(def ret     #dt/e (- (win/ratio :price) 1))\n(def log-ret #dt/e (log (+ 1 ret)))\n(def vol-20d #dt/e (win/mdev ret 20))\n(def wealth  #dt/e (win/scan * (+ 1 ret)))\n\n(dt prices :by [:permno] :within-order [(asc :date)]\n    :set {:ret ret :log-ret log-ret :vol-20d vol-20d :wealth wealth})\n```\n\nNo equivalent exists in tablecloth, dplyr, pandas, or data.table.\n\n## Installation\n\nAdd to your `deps.edn`:\n\n```clojure\n{:deps {com.github.clojure-finance/datajure {:mvn/version \"2.0.9\"}}}\n```\n\nDatajure requires Clojure 1.12+ and Java 21+.\n\n## The Key Insight: `:by` × `:set`/`:agg`\n\nTwo orthogonal keywords produce four distinct operations with no new concepts:\n\n|            | No `:by`            | With `:by`          |\n|------------|---------------------|---------------------|\n| **`:set`** | Column derivation (+ whole-dataset window if `win/*` present) | **Partitioned window** |\n| **`:agg`** | Whole-table summary | Group aggregation   |\n\n```clojure\n;; Column derivation — add/update columns, keep all rows\n(dt ds :set {:bmi #dt/e (/ :mass (sq :height))})\n\n;; Group aggregation — collapse rows per group\n(dt ds :by [:species] :agg {:n nrow :avg-mass #dt/e (mn :mass)})\n\n;; Whole-table summary — collapse everything\n(dt ds :agg {:total #dt/e (sm :mass) :n nrow})\n\n;; Partitioned window — compute within groups, keep all rows\n(dt ds\n  :by [:species]\n  :within-order [(desc :mass)]\n  :set {:rank #dt/e (win/rank :mass)\n        :cumul #dt/e (win/cumsum :mass)})\n\n;; Whole-dataset window — no :by, entire dataset is one partition\n(dt ds\n  :within-order [(asc :date)]\n  :set {:cumret #dt/e (win/cumsum :ret)\n        :prev   #dt/e (win/lag :price 1)})\n```\n\n`:within-order` also combines with `:agg`, sorting rows within each group before the aggregation runs. This is the one-call OHLC pattern and the reason `first-val` / `last-val` are first-class helpers:\n\n```clojure\n(dt trades\n    :by [:sym :date]\n    :within-order [(asc :time)]\n    :agg {:open  #dt/e (first-val :price)\n          :close #dt/e (last-val :price)\n          :hi    #dt/e (mx :price)\n          :vol   #dt/e (sm :size)})\n\n;; VWAP and weighted sum\n(dt trades :by [:sym :date]\n    :agg {:vwap #dt/e (wavg :size :price)\n          :vol  #dt/e (wsum :size :price)})\n```\n\n## `dt` Dispatch Modes\n\n`dt` runs a single fixed evaluation order: `:where` → `:set`-or-`:agg` → `:select` → `:order-by`. What the middle step does depends on which other keywords are present:\n\n| `:by`  | `:set`  | `:agg`  | `:within-order` | Mode                                                    |\n|--------|---------|---------|-----------------|---------------------------------------------------------|\n| —      | plain   | —       | —               | Derive columns over whole dataset                       |\n| —      | `win/*` | —       | optional        | Whole-dataset window                                    |\n| ✓      | plain   | —       | optional        | Per-group derivation                                    |\n| ✓      | `win/*` | —       | optional        | Partitioned window                                      |\n| —      | —       | ✓       | optional        | Whole-table aggregate (sorted first if `:within-order`) |\n| ✓      | —       | ✓       | optional        | Group aggregate (sorted within group if `:within-order`)|\n\nDisallowed: `:set` and `:agg` in the same call (use `-\u003e` threading); `:within-order` without `:set` or `:agg`.\n\n## Expression Mode: `#dt/e`\n\n`#dt/e` is a reader tag that rewrites bare keywords to column accessors. It returns an AST object that `dt` interprets — vectorized, pre-validated, and nil-literal-safe.\n\n```clojure\n;; With #dt/e — terse, keyword-lifted, vectorized\n(dt ds :where #dt/e (\u003e :mass 4000))\n(dt ds :set {:bmi #dt/e (/ :mass (sq :height))})\n\n;; Without — plain Clojure functions (always works)\n(dt ds :where #(\u003e (:mass %) 4000))\n(dt ds :set {:bmi #(/ (:mass %) (Math/pow (:height %) 2))})\n```\n\n`#dt/e` is opt-in. Users who prefer plain Clojure functions can ignore it entirely. See *Expression Mode vs. Plain Functions* below for when to pick which.\n\n### Nil handling\n\nDatajure has a layered nil story rather than blanket \"nil-safety\". The rules:\n\n| Situation                                             | Behaviour                |\n|-------------------------------------------------------|--------------------------|\n| Comparison op with a nil *literal* in `#dt/e`         | evaluates to `false`     |\n| Arithmetic op with a nil *literal* in `#dt/e`         | returns `nil`            |\n| Column-level nils (nil values within a column)        | depends on the `dfn` op  |\n| Aggregation helpers (`mn`/`sm`/`md`/`sd`/`nrow`/...)  | skip nil; `nil` if all missing (never `0`/`-Inf`/`NaN`) |\n| `win/fills :col`                                      | forward-fill nils        |\n| `coalesce :col default`                               | replace nils with fallback |\n| `div0 num den`                                        | `nil` if denominator is `nil` or zero |\n| `win/ratio :col`                                      | `nil` if previous value is `nil` or zero |\n| Plain Clojure functions                               | **not** automatic; wrap with `pass-nil` |\n\n```clojure\n(dt ds :where #dt/e (\u003e :mass 4000))                  ;; nil-literal → false\n(dt ds :set {:mass #dt/e (coalesce :mass 0)})         ;; nil → 0\n(dt ds :set {:pe   #dt/e (div0 :price :earnings)})    ;; zero denom → nil\n(dt ds :set {:x (pass-nil #(parse-int (:x-str %)))})  ;; wrap plain fn\n```\n\n### Special forms\n\n```clojure\n;; Multi-branch conditional\n(dt ds :set {:size #dt/e (cond\n                           (\u003e :mass 5000) \"large\"\n                           (\u003e :mass 3500) \"medium\"\n                           :else \"small\")})\n\n;; Local bindings\n(dt ds :set {:adj #dt/e (let [bmi (/ :mass (sq :height))\n                              base (if (\u003e :year 2010) 1.1 1.0)]\n                          (* base bmi))})\n\n;; Boolean composition, membership, range\n(dt ds :where #dt/e (and (\u003e :mass 4000) (not (= :species \"Adelie\"))))\n(dt ds :where #dt/e (in :species #{\"Gentoo\" \"Chinstrap\"}))\n(dt ds :where #dt/e (between? :year 2007 2009))\n```\n\n### Reusable expressions\n\n`#dt/e` returns first-class AST values. Store them in vars, reuse across queries, compose them into new expressions:\n\n```clojure\n(def bmi       #dt/e (/ :mass (sq :height)))\n(def high-mass #dt/e (\u003e :mass 4000))\n(def obese     #dt/e (\u003e bmi 30))         ;; composition — bmi appears inside another #dt/e\n\n(dt ds :set {:bmi bmi})\n(dt ds :where high-mass)\n(dt ds :by [:species] :agg {:avg-bmi #dt/e (mn bmi)})\n(dt ds :where obese)\n```\n\nThe mechanism is simple: `#dt/e` returns an AST map, and `(def ...)` captures that value. When the symbol appears inside another `#dt/e`, Clojure evaluates it to its AST value before the outer reader sees it, and the compiler splices it in. No macros, no magic — just values.\n\n### Expression Mode vs. Plain Functions\n\n|                       | `#dt/e` (column-wise)                  | Plain function (context-dependent)     |\n|-----------------------|----------------------------------------|----------------------------------------|\n| Operates on           | Whole column vectors via `dfn`         | Row map in `:set`/`:where`; group dataset in `:agg` |\n| Column access         | Bare keywords: `:mass`                 | `(:mass %)`                            |\n| Performance           | Fast — vectorized                      | Slower — per-row call in `:set`/`:where` |\n| Nil handling          | Automatic (for literals and helpers)   | Manual (`pass-nil` or explicit checks) |\n| Validation            | Pre-execution column checking; Damerau suggestions | Runtime errors only          |\n| Best for              | Arithmetic, comparisons, aggregations  | Complex branching, Java interop, non-vectorizable logic |\n\nPrefer `#dt/e` by default. Fall back to plain functions when the computation doesn't map to vectorized ops.\n\n**Footgun to know about in `:agg`:** plain functions receive the *group dataset*, not a row, so `(:mass %)` returns a column vector rather than a scalar. Datajure detects this and throws a structured error since v2.0.6 — but this is why `#dt/e (mn :mass)` is safer than `#(mean (:mass %))`.\n\n## `:select` — Polymorphic Column Selection\n\n```clojure\n(dt ds :select [:species :mass])                    ;; explicit list\n(dt ds :select :type/numerical)                     ;; all numeric columns\n(dt ds :select :!type/numerical)                    ;; all non-numeric\n(dt ds :select #\"body-.*\")                          ;; regex match\n(dt ds :select [:not :id :timestamp])               ;; exclusion\n(dt ds :select {:species :sp :mass :m})             ;; select + rename\n(dt ds :select (between :month-01 :month-12))       ;; positional range (inclusive)\n```\n\n## Window Functions\n\nAvailable via `win/*` inside `#dt/e`. Work in `:set` context — with `:by` for partitioned windows, or without `:by` for whole-dataset windows:\n\n```clojure\n;; Partitioned window — grouped by permno\n(dt ds\n  :by [:permno]\n  :within-order [(asc :date)]\n  :set {:rank    #dt/e (win/rank :ret)\n        :lag-1   #dt/e (win/lag :ret 1)\n        :cumret  #dt/e (win/cumsum :ret)\n        :regime  #dt/e (win/rleid :sign-ret)})\n\n;; Whole-dataset window — no :by, entire dataset is one partition\n(dt ds\n  :within-order [(asc :date)]\n  :set {:cumret #dt/e (win/cumsum :ret)\n        :prev   #dt/e (win/lag :price 1)})\n```\n\nFunctions: `win/rank`, `win/dense-rank`, `win/row-number`, `win/lag`, `win/lead`, `win/cumsum`, `win/cummin`, `win/cummax`, `win/cummean`, `win/rleid`, `win/delta`, `win/ratio`, `win/differ`, `win/mavg`, `win/msum`, `win/mdev`, `win/mmin`, `win/mmax`, `win/ema`, `win/fills`, `win/scan`, `win/each-prior`.\n\n### Adjacent-Element Ops\n\nInspired by q's `deltas` and `ratios` — eliminate verbose lag patterns:\n\n```clojure\n(dt ds :by [:permno] :within-order [(asc :date)]\n    :set {:ret       #dt/e (- (win/ratio :price) 1)    ;; simple return\n          :price-chg #dt/e (win/delta :price)          ;; first differences\n          :changed   #dt/e (win/differ :signal)})      ;; boolean change flag\n```\n\n`win/ratio` returns `nil` (not `Infinity`) when the previous value is zero or nil — the canonical simple-return idiom `(- (win/ratio :price) 1)` therefore produces `nil` after a zero-price row rather than contaminating downstream calculations.\n\n### Rolling Windows \u0026 EMA\n\n```clojure\n(dt ds :by [:permno] :within-order [(asc :date)]\n    :set {:ma-20   #dt/e (win/mavg :price 20)     ;; 20-day moving average\n          :vol-20  #dt/e (win/mdev :ret 20)       ;; 20-day moving std dev\n          :hi-52w  #dt/e (win/mmax :price 252)    ;; 52-week high\n          :ema-10  #dt/e (win/ema :price 10)})    ;; 10-day EMA\n```\n\n### Forward-Fill\n\n```clojure\n(dt ds :by [:permno] :within-order [(asc :date)]\n    :set {:price #dt/e (win/fills :price)})       ;; carry forward last known\n```\n\n### Cumulative Scan\n\nGeneralized cumulative operation inspired by APL/q's scan (`\\`). Supports `+`, `*`, `max`, `min` — the killer use case is the wealth index:\n\n```clojure\n(dt ds :by [:permno] :within-order [(asc :date)]\n    :set {:wealth  #dt/e (win/scan * (+ 1 :ret))   ;; cumulative compounding\n          :cum-vol #dt/e (win/scan + :volume)       ;; = win/cumsum\n          :runmax  #dt/e (win/scan max :price)})    ;; running maximum\n```\n\n### Generalized Adjacent-Element Ops (`win/each-prior`)\n\n`win/each-prior` is the generalization of `win/delta` and `win/ratio` — applies any binary operator to `f(x[i], x[i-1])`. Supports `+`, `-`, `*`, `/`, `max`, `min`, and comparison operators. First element → nil; nil propagates.\n\n```clojure\n(dt ds :by [:permno] :within-order [(asc :date)]\n    :set {;; subtract: same result as win/delta (without double-casting)\n          :chg     #dt/e (win/each-prior - :price)\n          ;; max with previous — running pairwise high\n          :pw-hi   #dt/e (win/each-prior max :price)\n          ;; boolean: did value increase?\n          :up?     #dt/e (win/each-prior \u003e :price)})\n```\n\nUse `win/delta` when you want the named function with its double-casting; use `win/ratio` when you need the zero-guard (nil instead of Infinity). Use `win/each-prior` when you need a different operator entirely.\n\n## Row-wise Functions\n\nCross-column operations within a single row via `row/*`:\n\n```clojure\n(dt ds :set {:total  #dt/e (row/sum :q1 :q2 :q3 :q4)\n             :avg-q  #dt/e (row/mean :q1 :q2 :q3 :q4)\n             :n-miss #dt/e (row/count-nil :q1 :q2 :q3 :q4)})\n```\n\nFunctions: `row/sum` (nil as 0), `row/mean`, `row/min`, `row/max` (skip nil), `row/count-nil`, `row/any-nil?`.\n\n## Statistical Transforms\n\nColumn-level transforms via `stat/*` inside `#dt/e`. All are nil-safe — nil values are excluded from reference statistics and produce nil outputs.\n\n```clojure\n;; Standardize: (x - mean) / sd — returns all-nil if sd is zero\n(dt ds :set {:z #dt/e (stat/standardize :ret)})\n\n;; Demean: x - mean(x)\n(dt ds :set {:dm #dt/e (stat/demean :ret)})\n\n;; Winsorize at 1% tails — clips to [p, 1-p] percentile bounds\n(dt ds :set {:wr #dt/e (stat/winsorize :ret 0.01)})\n\n;; Compose with arithmetic\n(dt ds :set {:scaled #dt/e (* 2 (stat/demean :x))})\n\n;; Cross-sectional standardization per group\n(dt ds :by [:date] :set {:z #dt/e (stat/standardize :signal)})\n```\n\nFunctions: `stat/standardize`, `stat/demean`, `stat/winsorize`.\n\n## Joins\n\nStandalone function with cardinality validation and merge diagnostics. Supports regular joins (`:inner`, `:left`, `:right`, `:outer`), as-of joins (`:asof`, with `:direction` and `:tolerance`), and window joins (`:window`, aggregates over matched sub-datasets).\n\n```clojure\n(require '[datajure.join :refer [join]])\n\n(join X Y :on :id :how :left)\n(join X Y :on [:firm :date] :how :inner :validate :m:1)\n(join X Y :left-on :id :right-on :key :how :left :report true)\n;; [datajure] join report: 150 matched, 3 left-only, 0 right-only\n\n;; Thread with dt\n(-\u003e (join X Y :on :id :how :left :validate :m:1)\n    (dt :where #dt/e (\u003e :year 2008)\n        :agg {:total #dt/e (sm :revenue)}))\n```\n\n## As-of Joins\n\nInspired by q's `aj`. For each left row, find the last right row where `right-key \u003c= left-key` within an exact-match group. All left rows are always preserved; unmatched rows get nil for right columns.\n\nThe **last column** in `:on` (or `:left-on`/`:right-on`) is the asof column — preceding columns are exact-match keys.\n\n```clojure\n(require '[datajure.join :refer [join]])\n\n;; Trade-quote matching: each trade gets the last prevailing bid/ask.\n;; sym is exact-match, time is asof (last quote where quote-time \u003c= trade-time)\n(join trades quotes :on [:sym :time] :how :asof)\n\n;; Asymmetric key names\n(join trades quotes\n      :left-on  [:sym :trade-time]\n      :right-on [:sym :quote-time]\n      :how :asof)\n\n;; With cardinality validation (right side only)\n(join trades quotes :on [:sym :time] :how :asof :validate :m:1)\n```\n\n**Result schema:** all left columns in original order, plus right non-key columns appended. Conflicting non-key column names are suffixed `:right.\u003cn\u003e` (same convention as regular joins).\n\n**`:validate` for `:asof`:** only the right side is checked (`:1:1` and `:m:1` require unique right keys). The left side is never checked since all left rows always appear.\n\n### Directional and Bounded As-of Joins\n\n`:direction` controls which side of the asof key is matched (default `:backward`). `:tolerance` sets a maximum allowable distance — matches beyond it produce nil.\n\n```clojure\n;; :forward — first right row where right-key \u003e= left-key\n(join left right :on [:sym :time] :how :asof :direction :forward)\n\n;; :nearest — closest right row by absolute distance; ties prefer :backward\n(join left right :on [:sym :time] :how :asof :direction :nearest)\n\n;; :tolerance — reject matches more than 5 time units away\n(join trades quotes :on [:sym :time] :how :asof :tolerance 5)\n\n;; Combine: nearest match within a 3-unit window\n(join left right :on [:time] :how :asof :direction :nearest :tolerance 3)\n```\n\n`:tolerance` requires a numeric asof key. Matches that exceed the tolerance produce nil for right columns — same as having no match.\n\n## Window Joins\n\nInspired by q's `wj`. For each left row, finds **all** right rows whose asof-key falls within a window around the left row's asof-key, then aggregates them with `:agg`. All left rows are preserved.\n\nThe **last column** in `:on` is the asof column — preceding columns are exact-match keys.\n\n```clojure\n(require '[datajure.join :refer [join]])\n\n;; 3-unit lookback: each left row aggregates right rows in [left-t - 3, left-t]\n(join trades quotes\n  :on [:sym :time]\n  :how :window\n  :window [-3 0]\n  :agg {:avg-bid #dt/e (mn :bid)\n        :n-quotes core/nrow})\n\n;; 5-minute lookback using temporal units\n(join trades quotes\n  :on [:sym :time]\n  :how :window\n  :window [-5 0 :minutes]\n  :agg {:avg-bid #dt/e (mn :bid)\n        :avg-ask #dt/e (mn :ask)\n        :n       core/nrow})\n\n;; Symmetric window: 2 units either side\n(join events signals\n  :on [:sym :time]\n  :how :window\n  :window [-2 2]\n  :agg {:mean-signal #dt/e (mn :value)})\n\n;; Asymmetric key names\n(join trades quotes\n  :left-on  [:sym :trade-time]\n  :right-on [:sym :quote-time]\n  :how :window\n  :window [-5 0 :minutes]\n  :agg {:vwap #dt/e (wavg :size :bid)})\n```\n\n**Window spec formats** — all three are equivalent:\n```clojure\n[-5 0 :minutes]   ;; [lo hi unit]  — recommended\n[-5 :minutes 0]   ;; [lo unit hi]  — also accepted\n[-300000 0]       ;; [lo hi]       ;; raw (300000 ms = 5 min)\n```\nSupported units: `:seconds`, `:minutes`, `:hours`, `:days`, `:weeks`.\n\n**`:agg` values:**\n- `#dt/e` expressions — apply to the matched sub-dataset; return **nil** for empty windows (avoids NaN from `dfn/mean` on empty columns)\n- Plain fns — receive the 0-row sub-dataset directly; `nrow` naturally returns **0** for empty windows\n\n**Result schema:** all left columns preserved, plus one column per `:agg` entry.\n\n```clojure\n;; VWAP over 5-minute rolling window — thread into dt\n(-\u003e (join trades quotes\n          :on [:sym :time]\n          :how :window\n          :window [-5 0 :minutes]\n          :agg {:vwap  #dt/e (wavg :size :bid)\n                :depth core/nrow})\n    (core/dt :where #dt/e (\u003e :depth 0)\n             :order-by [(core/asc :time)]))\n```\n\n## Reshaping\n\n```clojure\n(require '[datajure.reshape :refer [melt cast]])\n\n;; Wide → long\n(-\u003e ds\n    (melt {:id [:species :year] :measure [:mass :flipper :bill]})\n    (dt :by [:species :variable] :agg {:avg #dt/e (mn :value)}))\n\n;; Long → wide (complement to melt)\n(cast ds {:id [:species :year] :from :variable :value :value})\n\n;; With aggregation for duplicate (id, from) cells\n(cast ds {:id [:date :sym] :from :metric :value :val :agg dfn/mean})\n\n;; Round-trip\n(-\u003e ds\n    (melt {:id [:species :year] :measure [:mass :flipper]})\n    (cast {:id [:species :year] :from :variable :value :value}))\n```\n\n`cast` options: `:id` (required), `:from` (required), `:value` (required), `:agg` (fn applied to a vector of values when multiple rows share the same id+from combination; default: first value), `:fill` (value for missing cells; default: nil).\n\n## Utilities\n\n```clojure\n(require '[datajure.util :as du])\n\n(du/describe ds)                                ;; summary stats → dataset\n(du/describe ds [:mass :height])                ;; subset of columns\n(du/clean-column-names messy-ds)                ;; \"Some Ugly Name!\" → :some-ugly-name (Unicode-aware)\n(du/mark-duplicates ds [:id :date])             ;; adds :duplicate? column\n(du/drop-constant-columns ds)                   ;; remove zero-variance\n(du/coerce-columns ds {:year :int64 :mass :float64})\n```\n\n`clean-column-names` preserves non-ASCII characters (CJK, accented Latin, Cyrillic, Greek) — `\"市值 (HKD millions)!\"` becomes `:市值-hkd-millions`.\n\n## File I/O\n\n```clojure\n(require '[datajure.io :as dio])\n\n(def ds (dio/read \"data.csv\"))\n(def ds (dio/read \"data.parquet\"))    ;; needs tech.v3.libs.parquet\n(def ds (dio/read \"data.tsv.gz\"))     ;; gzip auto-detected\n(dio/write ds \"output.csv\")\n```\n\nSupported: CSV, TSV, Parquet, Arrow, Excel, Nippy. Gzipped variants auto-detected.\n\n## Bucketing with `xbar`\n\nFloor-division bucketing inspired by q's `xbar`. Primary use case is computed `:by` for time-series bar generation:\n\n```clojure\n;; Numeric bucketing in :by — price buckets of width 10\n(dt ds :by [(xbar :price 10)] :agg {:n nrow :avg #dt/e (mn :volume)})\n\n;; 5-minute OHLCV bars\n(dt trades\n    :by [(xbar :time 5 :minutes) :sym]\n    :within-order [(asc :time)]\n    :agg {:open  #dt/e (first-val :price)\n          :close #dt/e (last-val :price)\n          :vol   #dt/e (sm :size)\n          :n     nrow})\n\n;; Also usable inside #dt/e as a column derivation\n(dt ds :set {:bucket #dt/e (xbar :price 5)})\n```\n\nSupported temporal units: `:seconds`, `:minutes`, `:hours`, `:days`, `:weeks`. Returns nil for nil input.\n\n## Quantile Binning with `cut`\n\nEqual-count (quantile) binning inside `#dt/e`. The optional `:from` mask computes breakpoints from a **reference subpopulation** and applies them to **all rows** — the reference and binned populations can be different sizes. This directly models the NYSE-breakpoints pattern used in empirical finance:\n\n```clojure\n;; Basic: 5 equal-count bins across all rows\n(dt ds :set {:size-q #dt/e (cut :mktcap 5)})\n\n;; NYSE breakpoints: compute quintile breakpoints from NYSE stocks only,\n;; apply to all stocks (NYSE + AMEX + NASDAQ)\n(dt ds :set {:size-q #dt/e (cut :mktcap 5 :from (= :exchcd 1))})\n\n;; :from accepts any #dt/e boolean expression\n(dt ds :set {:size-q #dt/e (cut :mktcap 5 :from (and (= :exchcd 1) (\u003e :year 2000)))})\n\n;; Per-date NYSE breakpoints — the canonical CRSP usage\n(-\u003e crsp\n    (dt :where #dt/e (= (month :date) 6))\n    (dt :by [:date]\n        :set {:size-q #dt/e (cut :mktcap 5 :from (= :exchcd 1))}))\n```\n\n## Quantile Grouping with `qtile`\n\n`qtile` is the `:by`-friendly companion to `cut` — produces an equal-count bin assignment from a column's distribution. Use it when you want to *group by* quantile, rather than *derive a column of* quantile bins. Inspired by R's `cut` and Stata's `xtile`; named `qtile` to evoke quintile/decile:\n\n```clojure\n;; Global quintile buckets of market cap\n(dt stocks :by [(qtile :mktcap 5)]\n    :agg {:n nrow :mean-ret #dt/e (mn :ret)})\n;; Result column is auto-named :mktcap-q5\n\n;; Per-date size quintiles — the canonical CRSP / Fama-French pattern.\n;; Each date gets its own breakpoints.\n(dt stocks :by [:date (qtile :mktcap 5)]\n    :agg {:mean-ret #dt/e (mn :ret)})\n\n;; Per-date NYSE-style breakpoints applied to all stocks — Fama-French size sort.\n;; For each date, breakpoints are computed from that date's NYSE stocks only,\n;; then applied to all stocks (NYSE + AMEX + NASDAQ) on that date.\n(dt stocks :by [:date (qtile :mktcap 5 :from #dt/e (= :exchcd 1))]\n    :agg {:mean-ret #dt/e (mn :ret)})\n```\n\n**Breakpoint population depends on what else is in `:by`:**\n\n| `:by` shape | Breakpoints |\n|---|---|\n| `qtile` alone | Global — computed once from the whole dataset |\n| `qtile` + exact keys | Per-partition — computed within each exact-key combination (the data.table / dplyr default) |\n| `qtile :from \u003cmask\u003e` | Reference-subpopulation — the mask selects rows for breakpoint computation, applied in whichever population (global or per-partition) the rest of `:by` implies |\n\nFor the same bucketing semantics inside `#dt/e` expressions (`:set` / `:where` / `:agg`) rather than `:by`, use `#dt/e (cut :col n)`.\n\n| | `qtile` | `#dt/e (cut ...)` |\n|---|---|---|\n| Context | `:by` (grouping) | `:set` / `:where` / `:agg` (expression) |\n| Result | Integer bin key (1..n, or nil for nil input) | Column of bin integers |\n| Per-partition via | Exact keys in same `:by` | `:by` + `:set` window mode |\n| `:from` option | Supported (reference subpopulation) | Supported (reference subpopulation) |\n| Result column name | Auto `\u003ccol\u003e-q\u003cn\u003e` (customise via `:datajure/col` metadata) | Whatever you name it in `:set` |\n\nPick `qtile` when the bins are a grouping key; pick `cut` when the bins are a column value you want to keep alongside the original rows.\n\n**Note on small partitions.** If a partition has fewer than `n` non-nil values, breakpoints cannot be computed and all non-nil rows in that partition land in bin 1. Filter out thin partitions upstream or use fewer bins.\n\n## Computed `:by` — Custom Grouping Functions\n\n`:by` accepts a plain function of the row in addition to column keywords. Functions can attach `:datajure/col` metadata to control the result-column name:\n\n```clojure\n;; Simple computed :by\n(dt ds :by (fn [row] {:heavy? (\u003e (:mass row) 4000)})\n    :agg {:n nrow})\n\n;; Custom bucketing function with friendly result column name\n(defn percentile-bucket [col pct]\n  (with-meta\n    (fn [row]\n      (let [v (get row col)]\n        (when (some? v)\n          (int (* pct (/ v 100))))))\n    {:datajure/col (keyword (str (name col) \"-pct-bucket\"))}))\n\n(dt ds :by [(percentile-bucket :score 10)] :agg {:n nrow})\n;; Result column is named :score-pct-bucket\n```\n\n`xbar` uses the same mechanism internally. If no metadata is attached, result columns get synthetic names (`:fn-0`, `:fn-1`, ...).\n\n## Rename\n\n```clojure\n(rename ds {:mass :weight-kg :species :penguin-species})\n```\n\n## Concise Namespace\n\nShort aliases for power users (q / data.table refugees in particular):\n\n```clojure\n(require '[datajure.concise :refer [mn sm md sd ct nuniq fst lst wa ws mx mi N between]])\n\n(dt ds :by [:species] :agg {:n N :avg #dt/e (mn :mass)})\n```\n\n| Symbol | Full name |\n|--------|-----------|\n| `mn`   | mean |\n| `sm`   | sum |\n| `md`   | median |\n| `sd`   | stddev |\n| `mx`   | max (column maximum) |\n| `mi`   | min (column minimum) |\n| `ct`   | element count |\n| `nuniq`| count-distinct |\n| `fst`  | first-val |\n| `lst`  | last-val |\n| `wa`   | wavg (weighted average) |\n| `ws`   | wsum (weighted sum) |\n| `N`    | row count (alias for `nrow`) |\n| `standardize` | stat/stat-standardize |\n| `demean`      | stat/stat-demean |\n| `winsorize`   | stat/stat-winsorize |\n| `between`     | positional range selector |\n\nBoth `nrow` (discoverable) and `N` (terse, q/data.table style) live in `datajure.core`; `N` is also re-exported from `datajure.concise`.\n\n## Notebook Integration\n\n### Clay (Scicloj ecosystem)\n\n```clojure\n(require '[datajure.clay :as dc])\n(dc/install!)   ;; auto-renders datasets, #dt/e exprs, describe output\n\n;; Or explicit wrapping:\n(dc/view ds)\n(dc/view-expr #dt/e (/ :mass (sq :height)))\n(dc/view-describe (du/describe ds))\n```\n\nStart a Clay notebook:\n```clojure\n(require '[scicloj.clay.v2.api :as clay])\n(clay/make! {:source-path \"notebooks/datajure_clay_demo.clj\"})\n```\n\n### Clerk\n\n```clojure\n(require '[datajure.clerk :as dc])\n(dc/install!)   ;; registers custom Clerk viewers\n```\n\n## REPL\n\n`*dt*` holds the last dataset result (like `*1`), bound by nREPL middleware:\n\n```clojure\nuser=\u003e (dt ds :by [:species] :agg {:n nrow})\n;; =\u003e dataset...\n\nuser=\u003e (dt datajure.core/*dt* :order-by [(desc :n)])\n```\n\nEnable in `.nrepl.edn`: `{:middleware [datajure.nrepl/wrap-dt]}`\n\n## Error Messages\n\nStructured `ex-info` with suggestions. All errors carry a `:dt/error` key in `ex-data` for programmatic dispatch.\n\n**Unknown column — Damerau-Levenshtein suggestions catch transpositions:**\n\n```clojure\n(dt ds :set {:bmi #dt/e (/ :mass :hieght)})\n;; =\u003e ExceptionInfo: Unknown column(s) #{:hieght} in :set :bmi expression\n;;    Did you mean: :height (edit distance 1)\n;;    Available: :species :year :mass :height :flipper\n```\n\n**Unknown op — namespace-aware suggestions at read time:**\n\n```clojure\n#dt/e (sqrt :x)\n;; =\u003e ExceptionInfo: Unknown op `sqrt` in #dt/e expression. Did you mean: `sq`?\n\n#dt/e (win/mvag :price 20)\n;; =\u003e ExceptionInfo: Unknown op `win/mvag` in #dt/e expression. Did you mean: `win/mavg`?\n```\n\n**`:agg` plain-function footgun — detected and reported:**\n\n```clojure\n(dt ds :by [:species] :agg {:bad #(:mass %)})\n;; =\u003e ExceptionInfo: :agg plain function for column :bad returned a column, not a scalar.\n;;    In :agg, plain functions receive the group dataset, so `(:col %)` returns a column\n;;    vector. Use `(dfn/mean (:col %))` or prefer `#dt/e (mn :col)` which handles both\n;;    cases uniformly.\n```\n\n**Structural errors:**\n\n```clojure\n(dt ds :set {:a #dt/e (/ :x 1)} :agg {:n nrow})\n;; =\u003e ExceptionInfo: Cannot combine :set and :agg. Use -\u003e threading.\n\n(dt ds :set {:bmi  #dt/e (/ :mass (sq :height))\n             :obese #dt/e (\u003e :bmi 30)})\n;; =\u003e ExceptionInfo: Map-form :set cross-reference.\n;;    :obese references #{:bmi}, which is being derived in the same map.\n;;    Use vector-of-pairs [[:bmi ...] [:obese ...]] for sequential derivation.\n```\n\n## Evaluation Order\n\n`dt` evaluates keywords in this fixed order, regardless of the order they appear in the call:\n\n1. `:where` — filter rows\n2. `:set` or `:agg` — derive or aggregate (mutually exclusive; see dispatch modes above)\n3. `:select` — keep listed columns\n4. `:order-by` — sort final output\n\n## Architecture\n\n```\nUser writes:   #dt/e (/ :mass (sq :height))\n                          ↓\n               AST (pure data, serializable)\n                          ↓\n               compile-expr → fn [ds] → column vector\n                          ↓\n               tech.v3.datatype.functional (dfn)\n                          ↓\n               tech.v3.dataset (columnar, JVM, fast)\n```\n\nDatajure is a syntax layer. `#dt/e` expressions compile to an AST, which `compile-expr` translates to vectorized `dfn` operations on `tech.v3.dataset` column vectors. Computation is entirely delegated to the underlying engine; the DSL itself adds only the parsing and dispatch overhead.\n\n## Namespace Guide\n\n| Namespace | Purpose |\n|-----------|---------|\n| `datajure.core` | `dt`, `N`, `nrow`, `mean`, `sum`, `median`, `stddev`, `variance`, `max*`, `min*`, `count*`, `asc`, `desc`, `pass-nil`, `rename`, `xbar`, `qtile`, `cut`, `between`, `*dt*` |\n| `datajure.expr` | AST nodes, compiler, `#dt/e` reader tag |\n| `datajure.concise` | Short aliases for power users |\n| `datajure.window` | Window function implementations |\n| `datajure.row` | Row-wise function implementations |\n| `datajure.stat` | Statistical transforms: `stat/standardize`, `stat/demean`, `stat/winsorize` |\n| `datajure.util` | `describe`, `clean-column-names`, `duplicate-rows`, etc. |\n| `datajure.io` | Unified `read`/`write` dispatching on file extension |\n| `datajure.reshape` | `melt` for wide→long, `cast` for long→wide |\n| `datajure.join` | `join` with `:validate`, `:report`, `:how :asof` (`:direction`, `:tolerance`), and `:how :window` (`:window`, `:agg`) |\n| `datajure.asof` | As-of/window join engine: `asof-search`, `asof-indices`, `asof-match`, `build-result`, `window-indices` |\n| `datajure.nrepl` | nREPL middleware for `*dt*` auto-binding |\n| `datajure.clerk` | Rich Clerk notebook viewers |\n| `datajure.clay` | Clay/Kindly notebook integration |\n\n## Design Principles\n\n1. **`dt` is a function** — not a macro. Debuggable, composable, predictable.\n2. **`:where` always filters** — conditional updates go inside `:set` via `if`/`cond`.\n3. **Keyword lifting requires `#dt/e`** — no implicit magic in plain Clojure forms.\n4. **Layered nil story** — nil literals are safe in `#dt/e`, aggregation helpers skip nils, `coalesce`/`div0`/`win/fills` handle the rest, `pass-nil` wraps plain functions. Not a blanket \"nil-safe\" claim, but a coherent set of rules that eliminate the common NPE footguns.\n5. **Expressions are values** — `#dt/e` returns an AST, not a function. Store in vars, compose freely, build shared vocabularies.\n6. **One function, not twenty-eight** — one `dt`, seven keywords, two expression modes. Threading for pipelines.\n7. **Errors are data** — structured `ex-info` with `:dt/error` dispatch keys, Damerau-Levenshtein typo suggestions, extensible.\n8. **Syntax layer, not engine** — delegate to tech.v3.dataset. Full interop with tablecloth, Clerk, Clay, and the Scicloj ecosystem.\n9. **Steal the best ideas** — from data.table, q/kdb+, Polars, DataFramesMeta.jl, APL. The goal isn't to be any of them.\n\n## Development\n\nTests run automatically on every push to `main` via GitHub Actions. CI runs the core test suites (core, concise, util, io, reshape, join, asof, stat) via `bin/run-tests.sh`. The nrepl, clerk, and clay test suites require optional deps and are run locally only. When adding a new core test namespace, add it to `bin/run-tests.sh` to include it in CI.\n\n```bash\n# Start nREPL\nclj -A:nrepl\n\n# Run core tests (same as CI)\nbash bin/run-tests.sh\n\n# Run all tests locally (including optional-dep suites)\nclj -A:nrepl -e \"\n  (load-file \\\"test/datajure/core_test.clj\\\")\n  (load-file \\\"test/datajure/concise_test.clj\\\")\n  (load-file \\\"test/datajure/util_test.clj\\\")\n  (load-file \\\"test/datajure/io_test.clj\\\")\n  (load-file \\\"test/datajure/reshape_test.clj\\\")\n  (load-file \\\"test/datajure/join_test.clj\\\")\n  (load-file \\\"test/datajure/asof_test.clj\\\")\n  (load-file \\\"test/datajure/nrepl_test.clj\\\")\n  (load-file \\\"test/datajure/clerk_test.clj\\\")\n  (load-file \\\"test/datajure/clay_test.clj\\\")\n  (load-file \\\"test/datajure/stat_test.clj\\\")\n  (clojure.test/run-tests\n    'datajure.core-test 'datajure.concise-test 'datajure.util-test\n    'datajure.io-test 'datajure.reshape-test 'datajure.join-test\n    'datajure.asof-test 'datajure.nrepl-test 'datajure.clerk-test\n    'datajure.clay-test 'datajure.stat-test)\"\n```\n\n318 tests, 1093 assertions (CI subset: 276 tests, 989 assertions).\n\n## Prior Work\n\nDatajure v1 was a routing layer across three backends (tablecloth, clojask, geni/Spark). v2 takes a different approach: a single, opinionated syntax layer directly on tech.v3.dataset, stealing good ideas from data.table (query form), q/kdb+ (time-series primitives), Polars (expressions as values), and DataFramesMeta.jl (one function, keyword arguments).\n\n- v1 repo: https://github.com/clojure-finance/datajure/tree/v1\n\nSpecial thanks to [YANG Ming-Tian](https://github.com/skylee03) for the original v1 implementation.\n\n## License\n\nCopyright © 2024–2026 Centre for Investment Management, HKU Business School.\n\nDistributed under the Eclipse Public License version 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclojure-finance%2Fdatajure","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclojure-finance%2Fdatajure","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclojure-finance%2Fdatajure/lists"}