https://github.com/clojure-finance/datajure
Clojure data manipulation DSL — composable query syntax built on tech.ml.dataset
https://github.com/clojure-finance/datajure
clojure data-manipulation data-science dataframe dsl empirical-research query-dsl tech-ml-dataset
Last synced: about 2 months ago
JSON representation
Clojure data manipulation DSL — composable query syntax built on tech.ml.dataset
- Host: GitHub
- URL: https://github.com/clojure-finance/datajure
- Owner: clojure-finance
- License: other
- Created: 2022-02-06T00:42:57.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2026-04-17T03:55:05.000Z (about 2 months ago)
- Last Synced: 2026-04-17T04:17:43.450Z (about 2 months ago)
- Topics: clojure, data-manipulation, data-science, dataframe, dsl, empirical-research, query-dsl, tech-ml-dataset
- Language: Clojure
- Homepage: https://clojure-finance.github.io/datajure-website
- Size: 279 KB
- Stars: 14
- Watchers: 1
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Datajure v2
[](https://clojars.org/com.github.clojure-finance/datajure)
[](https://github.com/clojure-finance/datajure/actions/workflows/tests.yml)
[](https://cljdoc.org/d/com.github.clojure-finance/datajure/CURRENT)
**One function. Seven keywords. Two expression modes.**
Datajure is a Clojure data manipulation library built on [tech.ml.dataset](https://github.com/techascent/tech.ml.dataset). It provides a clean, composable query DSL for filtering, transforming, grouping, and aggregating tabular data.
```clojure
(require '[datajure.core :refer [dt nrow asc desc]])
;; Filter, group, aggregate — one call
(dt ds
:where #dt/e (> :year 2008)
:by [:species]
:agg {:n nrow :avg #dt/e (mn :mass)})
;; Window functions — same keywords, no new concepts
(dt ds
:by [:species]
:within-order [(desc :mass)]
:set {:rank #dt/e (win/rank :mass)})
;; OHLC bars in one call — :within-order with :agg sorts each group first
(dt trades
:by [:sym]
:within-order [(asc :time)]
:agg {:open #dt/e (first-val :price)
:close #dt/e (last-val :price)
:hi #dt/e (mx :price)
:lo #dt/e (mi :price)
:vol #dt/e (sm :size)})
;; Thread for multi-step pipelines
(-> ds
(dt :set {:bmi #dt/e (/ :mass (sq :height))})
(dt :by [:species] :agg {:avg-bmi #dt/e (mn :bmi)})
(dt :order-by [(desc :avg-bmi)]))
```
Datajure is a **syntax layer**, not an engine — it compiles `#dt/e` expressions to vectorized operations and delegates all computation to `tech.v3.dataset`. Every result is a standard `tech.v3.dataset` dataset. Full interop with tablecloth, Clerk, Clay, and the Scicloj ecosystem.
## Why Datajure
Datajure takes inspiration from whichever library or language got a given idea right — R's `data.table` (terse query form, single-expression semantics), APL/q/kdb+ (first-class primitives for time-series operations you use every day), Polars (expressions as values, composable vocabulary), Julia's `DataFramesMeta.jl` (one function with keyword arguments, not twenty-eight verbs). The goal is not to be any of them. It is to combine the parts that were genuinely revelations.
Concretely, if you've used:
- **R's `data.table`** — you'll find `DT[i, j, by]` maps directly onto `(dt ds :where i :set-or-agg j :by by)`. Nil handling is cleaner than data.table's `NA`. There is no in-place mutation (Datajure is immutable) and no secondary indexes (`setkey`); tech.v3.dataset's columnar layout is fast enough without them.
- **Python's pandas/Polars** — you get expression objects as values (like Polars' `Expr`), nil-safe comparisons and arithmetic by default, and a single query form rather than a pipeline of a dozen verbs.
- **R's `dplyr` or tidyverse** — you'll find the same pipe-friendly composition (`->` is Clojure's pipe), with less verbosity and without the function-per-verb proliferation.
- **Julia's `DataFramesMeta.jl`** — the `#dt/e` reader tag serves the same role as DFM's `@transform`/`@subset`, but because Clojure has a real reader tag mechanism (rather than macros pretending to parse expressions), it integrates more cleanly with the rest of the language.
- **q/kdb+** — the `win/*` namespace gives you first-class `deltas`, `ratios`, `mavg`, `msum`, `mdev`, `ema`, `fills`, `scan`, `each-prior`, plus `wavg`, `wsum`, `first`, `last` as aggregation primitives. `xbar` ships for time-series bar generation. As-of joins with `:direction` and `:tolerance` and window joins (`:how :window`) are built in.
Datajure's unique wedge is that `#dt/e` expressions are first-class AST values — you can store them in vars and compose them across queries. Build a shared vocabulary once, reuse it everywhere:
```clojure
(def ret #dt/e (- (win/ratio :price) 1))
(def log-ret #dt/e (log (+ 1 ret)))
(def vol-20d #dt/e (win/mdev ret 20))
(def wealth #dt/e (win/scan * (+ 1 ret)))
(dt prices :by [:permno] :within-order [(asc :date)]
:set {:ret ret :log-ret log-ret :vol-20d vol-20d :wealth wealth})
```
No equivalent exists in tablecloth, dplyr, pandas, or data.table.
## Installation
Add to your `deps.edn`:
```clojure
{:deps {com.github.clojure-finance/datajure {:mvn/version "2.0.9"}}}
```
Datajure requires Clojure 1.12+ and Java 21+.
## The Key Insight: `:by` × `:set`/`:agg`
Two orthogonal keywords produce four distinct operations with no new concepts:
| | No `:by` | With `:by` |
|------------|---------------------|---------------------|
| **`:set`** | Column derivation (+ whole-dataset window if `win/*` present) | **Partitioned window** |
| **`:agg`** | Whole-table summary | Group aggregation |
```clojure
;; Column derivation — add/update columns, keep all rows
(dt ds :set {:bmi #dt/e (/ :mass (sq :height))})
;; Group aggregation — collapse rows per group
(dt ds :by [:species] :agg {:n nrow :avg-mass #dt/e (mn :mass)})
;; Whole-table summary — collapse everything
(dt ds :agg {:total #dt/e (sm :mass) :n nrow})
;; Partitioned window — compute within groups, keep all rows
(dt ds
:by [:species]
:within-order [(desc :mass)]
:set {:rank #dt/e (win/rank :mass)
:cumul #dt/e (win/cumsum :mass)})
;; Whole-dataset window — no :by, entire dataset is one partition
(dt ds
:within-order [(asc :date)]
:set {:cumret #dt/e (win/cumsum :ret)
:prev #dt/e (win/lag :price 1)})
```
`:within-order` also combines with `:agg`, sorting rows within each group before the aggregation runs. This is the one-call OHLC pattern and the reason `first-val` / `last-val` are first-class helpers:
```clojure
(dt trades
:by [:sym :date]
:within-order [(asc :time)]
:agg {:open #dt/e (first-val :price)
:close #dt/e (last-val :price)
:hi #dt/e (mx :price)
:vol #dt/e (sm :size)})
;; VWAP and weighted sum
(dt trades :by [:sym :date]
:agg {:vwap #dt/e (wavg :size :price)
:vol #dt/e (wsum :size :price)})
```
## `dt` Dispatch Modes
`dt` runs a single fixed evaluation order: `:where` → `:set`-or-`:agg` → `:select` → `:order-by`. What the middle step does depends on which other keywords are present:
| `:by` | `:set` | `:agg` | `:within-order` | Mode |
|--------|---------|---------|-----------------|---------------------------------------------------------|
| — | plain | — | — | Derive columns over whole dataset |
| — | `win/*` | — | optional | Whole-dataset window |
| ✓ | plain | — | optional | Per-group derivation |
| ✓ | `win/*` | — | optional | Partitioned window |
| — | — | ✓ | optional | Whole-table aggregate (sorted first if `:within-order`) |
| ✓ | — | ✓ | optional | Group aggregate (sorted within group if `:within-order`)|
Disallowed: `:set` and `:agg` in the same call (use `->` threading); `:within-order` without `:set` or `:agg`.
## Expression Mode: `#dt/e`
`#dt/e` is a reader tag that rewrites bare keywords to column accessors. It returns an AST object that `dt` interprets — vectorized, pre-validated, and nil-literal-safe.
```clojure
;; With #dt/e — terse, keyword-lifted, vectorized
(dt ds :where #dt/e (> :mass 4000))
(dt ds :set {:bmi #dt/e (/ :mass (sq :height))})
;; Without — plain Clojure functions (always works)
(dt ds :where #(> (:mass %) 4000))
(dt ds :set {:bmi #(/ (:mass %) (Math/pow (:height %) 2))})
```
`#dt/e` is opt-in. Users who prefer plain Clojure functions can ignore it entirely. See *Expression Mode vs. Plain Functions* below for when to pick which.
### Nil handling
Datajure has a layered nil story rather than blanket "nil-safety". The rules:
| Situation | Behaviour |
|-------------------------------------------------------|--------------------------|
| Comparison op with a nil *literal* in `#dt/e` | evaluates to `false` |
| Arithmetic op with a nil *literal* in `#dt/e` | returns `nil` |
| Column-level nils (nil values within a column) | depends on the `dfn` op |
| Aggregation helpers (`mn`/`sm`/`md`/`sd`/`nrow`/...) | skip nil; `nil` if all missing (never `0`/`-Inf`/`NaN`) |
| `win/fills :col` | forward-fill nils |
| `coalesce :col default` | replace nils with fallback |
| `div0 num den` | `nil` if denominator is `nil` or zero |
| `win/ratio :col` | `nil` if previous value is `nil` or zero |
| Plain Clojure functions | **not** automatic; wrap with `pass-nil` |
```clojure
(dt ds :where #dt/e (> :mass 4000)) ;; nil-literal → false
(dt ds :set {:mass #dt/e (coalesce :mass 0)}) ;; nil → 0
(dt ds :set {:pe #dt/e (div0 :price :earnings)}) ;; zero denom → nil
(dt ds :set {:x (pass-nil #(parse-int (:x-str %)))}) ;; wrap plain fn
```
### Special forms
```clojure
;; Multi-branch conditional
(dt ds :set {:size #dt/e (cond
(> :mass 5000) "large"
(> :mass 3500) "medium"
:else "small")})
;; Local bindings
(dt ds :set {:adj #dt/e (let [bmi (/ :mass (sq :height))
base (if (> :year 2010) 1.1 1.0)]
(* base bmi))})
;; Boolean composition, membership, range
(dt ds :where #dt/e (and (> :mass 4000) (not (= :species "Adelie"))))
(dt ds :where #dt/e (in :species #{"Gentoo" "Chinstrap"}))
(dt ds :where #dt/e (between? :year 2007 2009))
```
### Reusable expressions
`#dt/e` returns first-class AST values. Store them in vars, reuse across queries, compose them into new expressions:
```clojure
(def bmi #dt/e (/ :mass (sq :height)))
(def high-mass #dt/e (> :mass 4000))
(def obese #dt/e (> bmi 30)) ;; composition — bmi appears inside another #dt/e
(dt ds :set {:bmi bmi})
(dt ds :where high-mass)
(dt ds :by [:species] :agg {:avg-bmi #dt/e (mn bmi)})
(dt ds :where obese)
```
The mechanism is simple: `#dt/e` returns an AST map, and `(def ...)` captures that value. When the symbol appears inside another `#dt/e`, Clojure evaluates it to its AST value before the outer reader sees it, and the compiler splices it in. No macros, no magic — just values.
### Expression Mode vs. Plain Functions
| | `#dt/e` (column-wise) | Plain function (context-dependent) |
|-----------------------|----------------------------------------|----------------------------------------|
| Operates on | Whole column vectors via `dfn` | Row map in `:set`/`:where`; group dataset in `:agg` |
| Column access | Bare keywords: `:mass` | `(:mass %)` |
| Performance | Fast — vectorized | Slower — per-row call in `:set`/`:where` |
| Nil handling | Automatic (for literals and helpers) | Manual (`pass-nil` or explicit checks) |
| Validation | Pre-execution column checking; Damerau suggestions | Runtime errors only |
| Best for | Arithmetic, comparisons, aggregations | Complex branching, Java interop, non-vectorizable logic |
Prefer `#dt/e` by default. Fall back to plain functions when the computation doesn't map to vectorized ops.
**Footgun to know about in `:agg`:** plain functions receive the *group dataset*, not a row, so `(:mass %)` returns a column vector rather than a scalar. Datajure detects this and throws a structured error since v2.0.6 — but this is why `#dt/e (mn :mass)` is safer than `#(mean (:mass %))`.
## `:select` — Polymorphic Column Selection
```clojure
(dt ds :select [:species :mass]) ;; explicit list
(dt ds :select :type/numerical) ;; all numeric columns
(dt ds :select :!type/numerical) ;; all non-numeric
(dt ds :select #"body-.*") ;; regex match
(dt ds :select [:not :id :timestamp]) ;; exclusion
(dt ds :select {:species :sp :mass :m}) ;; select + rename
(dt ds :select (between :month-01 :month-12)) ;; positional range (inclusive)
```
## Window Functions
Available via `win/*` inside `#dt/e`. Work in `:set` context — with `:by` for partitioned windows, or without `:by` for whole-dataset windows:
```clojure
;; Partitioned window — grouped by permno
(dt ds
:by [:permno]
:within-order [(asc :date)]
:set {:rank #dt/e (win/rank :ret)
:lag-1 #dt/e (win/lag :ret 1)
:cumret #dt/e (win/cumsum :ret)
:regime #dt/e (win/rleid :sign-ret)})
;; Whole-dataset window — no :by, entire dataset is one partition
(dt ds
:within-order [(asc :date)]
:set {:cumret #dt/e (win/cumsum :ret)
:prev #dt/e (win/lag :price 1)})
```
Functions: `win/rank`, `win/dense-rank`, `win/row-number`, `win/lag`, `win/lead`, `win/cumsum`, `win/cummin`, `win/cummax`, `win/cummean`, `win/rleid`, `win/delta`, `win/ratio`, `win/differ`, `win/mavg`, `win/msum`, `win/mdev`, `win/mmin`, `win/mmax`, `win/ema`, `win/fills`, `win/scan`, `win/each-prior`.
### Adjacent-Element Ops
Inspired by q's `deltas` and `ratios` — eliminate verbose lag patterns:
```clojure
(dt ds :by [:permno] :within-order [(asc :date)]
:set {:ret #dt/e (- (win/ratio :price) 1) ;; simple return
:price-chg #dt/e (win/delta :price) ;; first differences
:changed #dt/e (win/differ :signal)}) ;; boolean change flag
```
`win/ratio` returns `nil` (not `Infinity`) when the previous value is zero or nil — the canonical simple-return idiom `(- (win/ratio :price) 1)` therefore produces `nil` after a zero-price row rather than contaminating downstream calculations.
### Rolling Windows & EMA
```clojure
(dt ds :by [:permno] :within-order [(asc :date)]
:set {:ma-20 #dt/e (win/mavg :price 20) ;; 20-day moving average
:vol-20 #dt/e (win/mdev :ret 20) ;; 20-day moving std dev
:hi-52w #dt/e (win/mmax :price 252) ;; 52-week high
:ema-10 #dt/e (win/ema :price 10)}) ;; 10-day EMA
```
### Forward-Fill
```clojure
(dt ds :by [:permno] :within-order [(asc :date)]
:set {:price #dt/e (win/fills :price)}) ;; carry forward last known
```
### Cumulative Scan
Generalized cumulative operation inspired by APL/q's scan (`\`). Supports `+`, `*`, `max`, `min` — the killer use case is the wealth index:
```clojure
(dt ds :by [:permno] :within-order [(asc :date)]
:set {:wealth #dt/e (win/scan * (+ 1 :ret)) ;; cumulative compounding
:cum-vol #dt/e (win/scan + :volume) ;; = win/cumsum
:runmax #dt/e (win/scan max :price)}) ;; running maximum
```
### Generalized Adjacent-Element Ops (`win/each-prior`)
`win/each-prior` is the generalization of `win/delta` and `win/ratio` — applies any binary operator to `f(x[i], x[i-1])`. Supports `+`, `-`, `*`, `/`, `max`, `min`, and comparison operators. First element → nil; nil propagates.
```clojure
(dt ds :by [:permno] :within-order [(asc :date)]
:set {;; subtract: same result as win/delta (without double-casting)
:chg #dt/e (win/each-prior - :price)
;; max with previous — running pairwise high
:pw-hi #dt/e (win/each-prior max :price)
;; boolean: did value increase?
:up? #dt/e (win/each-prior > :price)})
```
Use `win/delta` when you want the named function with its double-casting; use `win/ratio` when you need the zero-guard (nil instead of Infinity). Use `win/each-prior` when you need a different operator entirely.
## Row-wise Functions
Cross-column operations within a single row via `row/*`:
```clojure
(dt ds :set {:total #dt/e (row/sum :q1 :q2 :q3 :q4)
:avg-q #dt/e (row/mean :q1 :q2 :q3 :q4)
:n-miss #dt/e (row/count-nil :q1 :q2 :q3 :q4)})
```
Functions: `row/sum` (nil as 0), `row/mean`, `row/min`, `row/max` (skip nil), `row/count-nil`, `row/any-nil?`.
## Statistical Transforms
Column-level transforms via `stat/*` inside `#dt/e`. All are nil-safe — nil values are excluded from reference statistics and produce nil outputs.
```clojure
;; Standardize: (x - mean) / sd — returns all-nil if sd is zero
(dt ds :set {:z #dt/e (stat/standardize :ret)})
;; Demean: x - mean(x)
(dt ds :set {:dm #dt/e (stat/demean :ret)})
;; Winsorize at 1% tails — clips to [p, 1-p] percentile bounds
(dt ds :set {:wr #dt/e (stat/winsorize :ret 0.01)})
;; Compose with arithmetic
(dt ds :set {:scaled #dt/e (* 2 (stat/demean :x))})
;; Cross-sectional standardization per group
(dt ds :by [:date] :set {:z #dt/e (stat/standardize :signal)})
```
Functions: `stat/standardize`, `stat/demean`, `stat/winsorize`.
## Joins
Standalone function with cardinality validation and merge diagnostics. Supports regular joins (`:inner`, `:left`, `:right`, `:outer`), as-of joins (`:asof`, with `:direction` and `:tolerance`), and window joins (`:window`, aggregates over matched sub-datasets).
```clojure
(require '[datajure.join :refer [join]])
(join X Y :on :id :how :left)
(join X Y :on [:firm :date] :how :inner :validate :m:1)
(join X Y :left-on :id :right-on :key :how :left :report true)
;; [datajure] join report: 150 matched, 3 left-only, 0 right-only
;; Thread with dt
(-> (join X Y :on :id :how :left :validate :m:1)
(dt :where #dt/e (> :year 2008)
:agg {:total #dt/e (sm :revenue)}))
```
## As-of Joins
Inspired by q's `aj`. For each left row, find the last right row where `right-key <= left-key` within an exact-match group. All left rows are always preserved; unmatched rows get nil for right columns.
The **last column** in `:on` (or `:left-on`/`:right-on`) is the asof column — preceding columns are exact-match keys.
```clojure
(require '[datajure.join :refer [join]])
;; Trade-quote matching: each trade gets the last prevailing bid/ask.
;; sym is exact-match, time is asof (last quote where quote-time <= trade-time)
(join trades quotes :on [:sym :time] :how :asof)
;; Asymmetric key names
(join trades quotes
:left-on [:sym :trade-time]
:right-on [:sym :quote-time]
:how :asof)
;; With cardinality validation (right side only)
(join trades quotes :on [:sym :time] :how :asof :validate :m:1)
```
**Result schema:** all left columns in original order, plus right non-key columns appended. Conflicting non-key column names are suffixed `:right.` (same convention as regular joins).
**`:validate` for `:asof`:** only the right side is checked (`:1:1` and `:m:1` require unique right keys). The left side is never checked since all left rows always appear.
### Directional and Bounded As-of Joins
`:direction` controls which side of the asof key is matched (default `:backward`). `:tolerance` sets a maximum allowable distance — matches beyond it produce nil.
```clojure
;; :forward — first right row where right-key >= left-key
(join left right :on [:sym :time] :how :asof :direction :forward)
;; :nearest — closest right row by absolute distance; ties prefer :backward
(join left right :on [:sym :time] :how :asof :direction :nearest)
;; :tolerance — reject matches more than 5 time units away
(join trades quotes :on [:sym :time] :how :asof :tolerance 5)
;; Combine: nearest match within a 3-unit window
(join left right :on [:time] :how :asof :direction :nearest :tolerance 3)
```
`:tolerance` requires a numeric asof key. Matches that exceed the tolerance produce nil for right columns — same as having no match.
## Window Joins
Inspired by q's `wj`. For each left row, finds **all** right rows whose asof-key falls within a window around the left row's asof-key, then aggregates them with `:agg`. All left rows are preserved.
The **last column** in `:on` is the asof column — preceding columns are exact-match keys.
```clojure
(require '[datajure.join :refer [join]])
;; 3-unit lookback: each left row aggregates right rows in [left-t - 3, left-t]
(join trades quotes
:on [:sym :time]
:how :window
:window [-3 0]
:agg {:avg-bid #dt/e (mn :bid)
:n-quotes core/nrow})
;; 5-minute lookback using temporal units
(join trades quotes
:on [:sym :time]
:how :window
:window [-5 0 :minutes]
:agg {:avg-bid #dt/e (mn :bid)
:avg-ask #dt/e (mn :ask)
:n core/nrow})
;; Symmetric window: 2 units either side
(join events signals
:on [:sym :time]
:how :window
:window [-2 2]
:agg {:mean-signal #dt/e (mn :value)})
;; Asymmetric key names
(join trades quotes
:left-on [:sym :trade-time]
:right-on [:sym :quote-time]
:how :window
:window [-5 0 :minutes]
:agg {:vwap #dt/e (wavg :size :bid)})
```
**Window spec formats** — all three are equivalent:
```clojure
[-5 0 :minutes] ;; [lo hi unit] — recommended
[-5 :minutes 0] ;; [lo unit hi] — also accepted
[-300000 0] ;; [lo hi] ;; raw (300000 ms = 5 min)
```
Supported units: `:seconds`, `:minutes`, `:hours`, `:days`, `:weeks`.
**`:agg` values:**
- `#dt/e` expressions — apply to the matched sub-dataset; return **nil** for empty windows (avoids NaN from `dfn/mean` on empty columns)
- Plain fns — receive the 0-row sub-dataset directly; `nrow` naturally returns **0** for empty windows
**Result schema:** all left columns preserved, plus one column per `:agg` entry.
```clojure
;; VWAP over 5-minute rolling window — thread into dt
(-> (join trades quotes
:on [:sym :time]
:how :window
:window [-5 0 :minutes]
:agg {:vwap #dt/e (wavg :size :bid)
:depth core/nrow})
(core/dt :where #dt/e (> :depth 0)
:order-by [(core/asc :time)]))
```
## Reshaping
```clojure
(require '[datajure.reshape :refer [melt cast]])
;; Wide → long
(-> ds
(melt {:id [:species :year] :measure [:mass :flipper :bill]})
(dt :by [:species :variable] :agg {:avg #dt/e (mn :value)}))
;; Long → wide (complement to melt)
(cast ds {:id [:species :year] :from :variable :value :value})
;; With aggregation for duplicate (id, from) cells
(cast ds {:id [:date :sym] :from :metric :value :val :agg dfn/mean})
;; Round-trip
(-> ds
(melt {:id [:species :year] :measure [:mass :flipper]})
(cast {:id [:species :year] :from :variable :value :value}))
```
`cast` options: `:id` (required), `:from` (required), `:value` (required), `:agg` (fn applied to a vector of values when multiple rows share the same id+from combination; default: first value), `:fill` (value for missing cells; default: nil).
## Utilities
```clojure
(require '[datajure.util :as du])
(du/describe ds) ;; summary stats → dataset
(du/describe ds [:mass :height]) ;; subset of columns
(du/clean-column-names messy-ds) ;; "Some Ugly Name!" → :some-ugly-name (Unicode-aware)
(du/mark-duplicates ds [:id :date]) ;; adds :duplicate? column
(du/drop-constant-columns ds) ;; remove zero-variance
(du/coerce-columns ds {:year :int64 :mass :float64})
```
`clean-column-names` preserves non-ASCII characters (CJK, accented Latin, Cyrillic, Greek) — `"市值 (HKD millions)!"` becomes `:市值-hkd-millions`.
## File I/O
```clojure
(require '[datajure.io :as dio])
(def ds (dio/read "data.csv"))
(def ds (dio/read "data.parquet")) ;; needs tech.v3.libs.parquet
(def ds (dio/read "data.tsv.gz")) ;; gzip auto-detected
(dio/write ds "output.csv")
```
Supported: CSV, TSV, Parquet, Arrow, Excel, Nippy. Gzipped variants auto-detected.
## Bucketing with `xbar`
Floor-division bucketing inspired by q's `xbar`. Primary use case is computed `:by` for time-series bar generation:
```clojure
;; Numeric bucketing in :by — price buckets of width 10
(dt ds :by [(xbar :price 10)] :agg {:n nrow :avg #dt/e (mn :volume)})
;; 5-minute OHLCV bars
(dt trades
:by [(xbar :time 5 :minutes) :sym]
:within-order [(asc :time)]
:agg {:open #dt/e (first-val :price)
:close #dt/e (last-val :price)
:vol #dt/e (sm :size)
:n nrow})
;; Also usable inside #dt/e as a column derivation
(dt ds :set {:bucket #dt/e (xbar :price 5)})
```
Supported temporal units: `:seconds`, `:minutes`, `:hours`, `:days`, `:weeks`. Returns nil for nil input.
## Quantile Binning with `cut`
Equal-count (quantile) binning inside `#dt/e`. The optional `:from` mask computes breakpoints from a **reference subpopulation** and applies them to **all rows** — the reference and binned populations can be different sizes. This directly models the NYSE-breakpoints pattern used in empirical finance:
```clojure
;; Basic: 5 equal-count bins across all rows
(dt ds :set {:size-q #dt/e (cut :mktcap 5)})
;; NYSE breakpoints: compute quintile breakpoints from NYSE stocks only,
;; apply to all stocks (NYSE + AMEX + NASDAQ)
(dt ds :set {:size-q #dt/e (cut :mktcap 5 :from (= :exchcd 1))})
;; :from accepts any #dt/e boolean expression
(dt ds :set {:size-q #dt/e (cut :mktcap 5 :from (and (= :exchcd 1) (> :year 2000)))})
;; Per-date NYSE breakpoints — the canonical CRSP usage
(-> crsp
(dt :where #dt/e (= (month :date) 6))
(dt :by [:date]
:set {:size-q #dt/e (cut :mktcap 5 :from (= :exchcd 1))}))
```
## Quantile Grouping with `qtile`
`qtile` is the `:by`-friendly companion to `cut` — produces an equal-count bin assignment from a column's distribution. Use it when you want to *group by* quantile, rather than *derive a column of* quantile bins. Inspired by R's `cut` and Stata's `xtile`; named `qtile` to evoke quintile/decile:
```clojure
;; Global quintile buckets of market cap
(dt stocks :by [(qtile :mktcap 5)]
:agg {:n nrow :mean-ret #dt/e (mn :ret)})
;; Result column is auto-named :mktcap-q5
;; Per-date size quintiles — the canonical CRSP / Fama-French pattern.
;; Each date gets its own breakpoints.
(dt stocks :by [:date (qtile :mktcap 5)]
:agg {:mean-ret #dt/e (mn :ret)})
;; Per-date NYSE-style breakpoints applied to all stocks — Fama-French size sort.
;; For each date, breakpoints are computed from that date's NYSE stocks only,
;; then applied to all stocks (NYSE + AMEX + NASDAQ) on that date.
(dt stocks :by [:date (qtile :mktcap 5 :from #dt/e (= :exchcd 1))]
:agg {:mean-ret #dt/e (mn :ret)})
```
**Breakpoint population depends on what else is in `:by`:**
| `:by` shape | Breakpoints |
|---|---|
| `qtile` alone | Global — computed once from the whole dataset |
| `qtile` + exact keys | Per-partition — computed within each exact-key combination (the data.table / dplyr default) |
| `qtile :from ` | Reference-subpopulation — the mask selects rows for breakpoint computation, applied in whichever population (global or per-partition) the rest of `:by` implies |
For the same bucketing semantics inside `#dt/e` expressions (`:set` / `:where` / `:agg`) rather than `:by`, use `#dt/e (cut :col n)`.
| | `qtile` | `#dt/e (cut ...)` |
|---|---|---|
| Context | `:by` (grouping) | `:set` / `:where` / `:agg` (expression) |
| Result | Integer bin key (1..n, or nil for nil input) | Column of bin integers |
| Per-partition via | Exact keys in same `:by` | `:by` + `:set` window mode |
| `:from` option | Supported (reference subpopulation) | Supported (reference subpopulation) |
| Result column name | Auto `-q` (customise via `:datajure/col` metadata) | Whatever you name it in `:set` |
Pick `qtile` when the bins are a grouping key; pick `cut` when the bins are a column value you want to keep alongside the original rows.
**Note on small partitions.** If a partition has fewer than `n` non-nil values, breakpoints cannot be computed and all non-nil rows in that partition land in bin 1. Filter out thin partitions upstream or use fewer bins.
## Computed `:by` — Custom Grouping Functions
`:by` accepts a plain function of the row in addition to column keywords. Functions can attach `:datajure/col` metadata to control the result-column name:
```clojure
;; Simple computed :by
(dt ds :by (fn [row] {:heavy? (> (:mass row) 4000)})
:agg {:n nrow})
;; Custom bucketing function with friendly result column name
(defn percentile-bucket [col pct]
(with-meta
(fn [row]
(let [v (get row col)]
(when (some? v)
(int (* pct (/ v 100))))))
{:datajure/col (keyword (str (name col) "-pct-bucket"))}))
(dt ds :by [(percentile-bucket :score 10)] :agg {:n nrow})
;; Result column is named :score-pct-bucket
```
`xbar` uses the same mechanism internally. If no metadata is attached, result columns get synthetic names (`:fn-0`, `:fn-1`, ...).
## Rename
```clojure
(rename ds {:mass :weight-kg :species :penguin-species})
```
## Concise Namespace
Short aliases for power users (q / data.table refugees in particular):
```clojure
(require '[datajure.concise :refer [mn sm md sd ct nuniq fst lst wa ws mx mi N between]])
(dt ds :by [:species] :agg {:n N :avg #dt/e (mn :mass)})
```
| Symbol | Full name |
|--------|-----------|
| `mn` | mean |
| `sm` | sum |
| `md` | median |
| `sd` | stddev |
| `mx` | max (column maximum) |
| `mi` | min (column minimum) |
| `ct` | element count |
| `nuniq`| count-distinct |
| `fst` | first-val |
| `lst` | last-val |
| `wa` | wavg (weighted average) |
| `ws` | wsum (weighted sum) |
| `N` | row count (alias for `nrow`) |
| `standardize` | stat/stat-standardize |
| `demean` | stat/stat-demean |
| `winsorize` | stat/stat-winsorize |
| `between` | positional range selector |
Both `nrow` (discoverable) and `N` (terse, q/data.table style) live in `datajure.core`; `N` is also re-exported from `datajure.concise`.
## Notebook Integration
### Clay (Scicloj ecosystem)
```clojure
(require '[datajure.clay :as dc])
(dc/install!) ;; auto-renders datasets, #dt/e exprs, describe output
;; Or explicit wrapping:
(dc/view ds)
(dc/view-expr #dt/e (/ :mass (sq :height)))
(dc/view-describe (du/describe ds))
```
Start a Clay notebook:
```clojure
(require '[scicloj.clay.v2.api :as clay])
(clay/make! {:source-path "notebooks/datajure_clay_demo.clj"})
```
### Clerk
```clojure
(require '[datajure.clerk :as dc])
(dc/install!) ;; registers custom Clerk viewers
```
## REPL
`*dt*` holds the last dataset result (like `*1`), bound by nREPL middleware:
```clojure
user=> (dt ds :by [:species] :agg {:n nrow})
;; => dataset...
user=> (dt datajure.core/*dt* :order-by [(desc :n)])
```
Enable in `.nrepl.edn`: `{:middleware [datajure.nrepl/wrap-dt]}`
## Error Messages
Structured `ex-info` with suggestions. All errors carry a `:dt/error` key in `ex-data` for programmatic dispatch.
**Unknown column — Damerau-Levenshtein suggestions catch transpositions:**
```clojure
(dt ds :set {:bmi #dt/e (/ :mass :hieght)})
;; => ExceptionInfo: Unknown column(s) #{:hieght} in :set :bmi expression
;; Did you mean: :height (edit distance 1)
;; Available: :species :year :mass :height :flipper
```
**Unknown op — namespace-aware suggestions at read time:**
```clojure
#dt/e (sqrt :x)
;; => ExceptionInfo: Unknown op `sqrt` in #dt/e expression. Did you mean: `sq`?
#dt/e (win/mvag :price 20)
;; => ExceptionInfo: Unknown op `win/mvag` in #dt/e expression. Did you mean: `win/mavg`?
```
**`:agg` plain-function footgun — detected and reported:**
```clojure
(dt ds :by [:species] :agg {:bad #(:mass %)})
;; => ExceptionInfo: :agg plain function for column :bad returned a column, not a scalar.
;; In :agg, plain functions receive the group dataset, so `(:col %)` returns a column
;; vector. Use `(dfn/mean (:col %))` or prefer `#dt/e (mn :col)` which handles both
;; cases uniformly.
```
**Structural errors:**
```clojure
(dt ds :set {:a #dt/e (/ :x 1)} :agg {:n nrow})
;; => ExceptionInfo: Cannot combine :set and :agg. Use -> threading.
(dt ds :set {:bmi #dt/e (/ :mass (sq :height))
:obese #dt/e (> :bmi 30)})
;; => ExceptionInfo: Map-form :set cross-reference.
;; :obese references #{:bmi}, which is being derived in the same map.
;; Use vector-of-pairs [[:bmi ...] [:obese ...]] for sequential derivation.
```
## Evaluation Order
`dt` evaluates keywords in this fixed order, regardless of the order they appear in the call:
1. `:where` — filter rows
2. `:set` or `:agg` — derive or aggregate (mutually exclusive; see dispatch modes above)
3. `:select` — keep listed columns
4. `:order-by` — sort final output
## Architecture
```
User writes: #dt/e (/ :mass (sq :height))
↓
AST (pure data, serializable)
↓
compile-expr → fn [ds] → column vector
↓
tech.v3.datatype.functional (dfn)
↓
tech.v3.dataset (columnar, JVM, fast)
```
Datajure is a syntax layer. `#dt/e` expressions compile to an AST, which `compile-expr` translates to vectorized `dfn` operations on `tech.v3.dataset` column vectors. Computation is entirely delegated to the underlying engine; the DSL itself adds only the parsing and dispatch overhead.
## Namespace Guide
| Namespace | Purpose |
|-----------|---------|
| `datajure.core` | `dt`, `N`, `nrow`, `mean`, `sum`, `median`, `stddev`, `variance`, `max*`, `min*`, `count*`, `asc`, `desc`, `pass-nil`, `rename`, `xbar`, `qtile`, `cut`, `between`, `*dt*` |
| `datajure.expr` | AST nodes, compiler, `#dt/e` reader tag |
| `datajure.concise` | Short aliases for power users |
| `datajure.window` | Window function implementations |
| `datajure.row` | Row-wise function implementations |
| `datajure.stat` | Statistical transforms: `stat/standardize`, `stat/demean`, `stat/winsorize` |
| `datajure.util` | `describe`, `clean-column-names`, `duplicate-rows`, etc. |
| `datajure.io` | Unified `read`/`write` dispatching on file extension |
| `datajure.reshape` | `melt` for wide→long, `cast` for long→wide |
| `datajure.join` | `join` with `:validate`, `:report`, `:how :asof` (`:direction`, `:tolerance`), and `:how :window` (`:window`, `:agg`) |
| `datajure.asof` | As-of/window join engine: `asof-search`, `asof-indices`, `asof-match`, `build-result`, `window-indices` |
| `datajure.nrepl` | nREPL middleware for `*dt*` auto-binding |
| `datajure.clerk` | Rich Clerk notebook viewers |
| `datajure.clay` | Clay/Kindly notebook integration |
## Design Principles
1. **`dt` is a function** — not a macro. Debuggable, composable, predictable.
2. **`:where` always filters** — conditional updates go inside `:set` via `if`/`cond`.
3. **Keyword lifting requires `#dt/e`** — no implicit magic in plain Clojure forms.
4. **Layered nil story** — nil literals are safe in `#dt/e`, aggregation helpers skip nils, `coalesce`/`div0`/`win/fills` handle the rest, `pass-nil` wraps plain functions. Not a blanket "nil-safe" claim, but a coherent set of rules that eliminate the common NPE footguns.
5. **Expressions are values** — `#dt/e` returns an AST, not a function. Store in vars, compose freely, build shared vocabularies.
6. **One function, not twenty-eight** — one `dt`, seven keywords, two expression modes. Threading for pipelines.
7. **Errors are data** — structured `ex-info` with `:dt/error` dispatch keys, Damerau-Levenshtein typo suggestions, extensible.
8. **Syntax layer, not engine** — delegate to tech.v3.dataset. Full interop with tablecloth, Clerk, Clay, and the Scicloj ecosystem.
9. **Steal the best ideas** — from data.table, q/kdb+, Polars, DataFramesMeta.jl, APL. The goal isn't to be any of them.
## Development
Tests run automatically on every push to `main` via GitHub Actions. CI runs the core test suites (core, concise, util, io, reshape, join, asof, stat) via `bin/run-tests.sh`. The nrepl, clerk, and clay test suites require optional deps and are run locally only. When adding a new core test namespace, add it to `bin/run-tests.sh` to include it in CI.
```bash
# Start nREPL
clj -A:nrepl
# Run core tests (same as CI)
bash bin/run-tests.sh
# Run all tests locally (including optional-dep suites)
clj -A:nrepl -e "
(load-file \"test/datajure/core_test.clj\")
(load-file \"test/datajure/concise_test.clj\")
(load-file \"test/datajure/util_test.clj\")
(load-file \"test/datajure/io_test.clj\")
(load-file \"test/datajure/reshape_test.clj\")
(load-file \"test/datajure/join_test.clj\")
(load-file \"test/datajure/asof_test.clj\")
(load-file \"test/datajure/nrepl_test.clj\")
(load-file \"test/datajure/clerk_test.clj\")
(load-file \"test/datajure/clay_test.clj\")
(load-file \"test/datajure/stat_test.clj\")
(clojure.test/run-tests
'datajure.core-test 'datajure.concise-test 'datajure.util-test
'datajure.io-test 'datajure.reshape-test 'datajure.join-test
'datajure.asof-test 'datajure.nrepl-test 'datajure.clerk-test
'datajure.clay-test 'datajure.stat-test)"
```
318 tests, 1093 assertions (CI subset: 276 tests, 989 assertions).
## Prior Work
Datajure v1 was a routing layer across three backends (tablecloth, clojask, geni/Spark). v2 takes a different approach: a single, opinionated syntax layer directly on tech.v3.dataset, stealing good ideas from data.table (query form), q/kdb+ (time-series primitives), Polars (expressions as values), and DataFramesMeta.jl (one function, keyword arguments).
- v1 repo: https://github.com/clojure-finance/datajure/tree/v1
Special thanks to [YANG Ming-Tian](https://github.com/skylee03) for the original v1 implementation.
## License
Copyright © 2024–2026 Centre for Investment Management, HKU Business School.
Distributed under the Eclipse Public License version 2.0.