{"id":50794170,"url":"https://github.com/singhpratech/samkhya","last_synced_at":"2026-06-12T13:30:19.185Z","repository":{"id":358563849,"uuid":"1240429591","full_name":"singhpratech/samkhya","owner":"singhpratech","description":null,"archived":false,"fork":false,"pushed_at":"2026-05-18T01:44:08.000Z","size":624,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-18T02:17:51.493Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/singhpratech.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE-APACHE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-16T05:54:12.000Z","updated_at":"2026-05-18T01:49:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/singhpratech/samkhya","commit_stats":null,"previous_names":["singhpratech/samkhya"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/singhpratech/samkhya","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singhpratech%2Fsamkhya","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singhpratech%2Fsamkhya/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singhpratech%2Fsamkhya/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singhpratech%2Fsamkhya/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/singhpratech","download_url":"https://codeload.github.com/singhpratech/samkhya/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/singhpratech%2Fsamkhya/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34247459,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-12T13:30:17.986Z","updated_at":"2026-06-12T13:30:19.174Z","avatar_url":"https://github.com/singhpratech.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# samkhya — सांख्य\n\n\u003e **samkhya is the engine-agnostic Rust SDK for feedback-driven cardinality correction in embedded analytical engines.** Plug GBT, TabPFN-2.5, or any LLM as your corrector backend. Measured **40.95× wallclock speedup on star-5 join topologies** (BCa 95% CI [30.93, 47.45], Wilcoxon p=1.73×10⁻⁶) over native DataFusion 46 LpBound tightness; provably-tighter `LpJoinBound` theorem (strict over AGM, p\u003c10⁻⁵ every cell). **13-crate SDK**: DataFusion, DuckDB, Polars, Postgres, Iceberg, Arrow, GPU, Python.\n\n[![CI](https://github.com/singhpratech/samkhya/workflows/CI/badge.svg)](https://github.com/singhpratech/samkhya/actions)\n[![crates.io](https://img.shields.io/crates/v/samkhya-core.svg)](https://crates.io/crates/samkhya-core)\n[![docs.rs](https://img.shields.io/docsrs/samkhya-core)](https://docs.rs/samkhya-core)\n[![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](#license)\n\nThe name is the Sanskrit word सांख्य — *\"enumeration / counting\"* — a classical\ndarshana whose discipline is counting reality's constituents honestly. The\nlibrary's only job is to make row counts accurate for the engines that have\nbeen left without an answer: DuckDB, DataFusion, Polars, Postgres, Iceberg,\nand gpudb.\n\n---\n\n## Why samkhya?\n\n- **Portability via Iceberg Puffin sidecars.** Classical sketches (HLL, Bloom,\n  Count-Min, equi-depth histogram, 2D correlated histogram) are serialized to\n  versioned, `KIND`-tagged blobs inside [Iceberg Puffin](https://iceberg.apache.org/puffin-spec/)\n  files. The same sidecar a Python ELT job writes at midnight is the sidecar\n  DataFusion reads at noon and DuckDB reads at three. No engine owns the stats;\n  the sidecar does.\n- **Safety via the LpBound clamp.** Every corrected estimate is bounded above\n  by a provable pessimistic ceiling derived from Zhang et al., SIGMOD 2025 Best\n  Paper — LP relaxation over ℓp-norms of degree sequences, no machine learning\n  involved. Cold start equals the native plan or better, never worse.\n- **Pluggable corrector backend (GBT default · TabPFN-2.5 · LLM-pluggable, all shipping).**\n  The `Corrector` trait is the pluggable surface and the *contribution*: one\n  trait, multiple production backends. Default ships a sub-MB\n  gradient-boosted-tree backend (gbdt-rs, Baidu). TabPFN-2.5 (Hollmann ICLR\n  2023 + Prior Labs 2026 update) opt-in behind `tabpfn_http` feature —\n  **measured P95 31.15 ms at B=8 L=128 on RTX 4090 Laptop, BCa 95% CI [29.39,\n  35.32]**, q-error reduction 7.84% vs GBT on synthetic. **LLM-pluggable\n  HTTP corrector ships dual transport in v1.0:** canonical Python FastAPI\n  server (`samkhya-gpudb/scripts/llm_infer_server.py`, port 8766 —\n  this is what `bench-results/19_llm_corrector.md` §4.1 measured) and a\n  parity Node TypeScript port (`llm_infer_server.ts`, port 8767, same\n  wire contract, broader operator appeal). Four reference backends in\n  each: Anthropic, OpenAI, local Ollama, dummy. The TS port's 30-trial\n  paired benchmark campaign is a v1.1 item (smoke-tested at v1.0).\n  Every backend gated behind a Cargo feature flag and capped from above\n  by the LpBound safety envelope.\n\nsamkhya is a library, not a service. No daemon, no background thread, no GPU\nrequirement in the default build. The entire workspace builds in under two\nminutes on a laptop with no network access.\n\n---\n\n## Quick start\n\nAdd the core crate to a Rust project:\n\n```bash\ncargo add samkhya-core\n```\n\nBuild a Puffin sidecar from a column in five lines:\n\n```rust\nuse samkhya_core::sketches::{HllSketch, Sketch};\nuse samkhya_core::puffin::PuffinWriter;\n\nlet mut hll = HllSketch::new(12)?;\nfor v in \u0026column { hll.add(v); }\nlet mut w = PuffinWriter::create(\"orders.puffin\")?;\nw.add_blob(HllSketch::KIND, \u0026hll.to_bytes()?)?;\nw.finish()?;\n```\n\nConsume those stats from DataFusion via the table-provider adapter:\n\n```rust\nuse datafusion::prelude::SessionContext;\nuse samkhya_datafusion::{SamkhyaTableProvider, SamkhyaOptimizerRule};\n\nlet ctx = SessionContext::new();\nctx.state().add_optimizer_rule(Arc::new(SamkhyaOptimizerRule::default()));\nlet provider = SamkhyaTableProvider::wrap(inner_provider)\n    .with_puffin_sidecar(\"orders.puffin\")?;\nctx.register_table(\"orders\", Arc::new(provider))?;\n```\n\nThe `samkhya_leaves_seen` diagnostic on the optimizer rule confirms the\ncorrected stats reached the physical plan.\n\n---\n\n## What's in 1.0\n\n**Thirteen crates** in one Cargo workspace. Licensed under Apache-2.0\n(explicit patent grant per §3). Edition 2024. MSRV Rust 1.85; CI tests\non 1.94 (the pinned project toolchain).\n\nLayer 1 — portable stats foundation:\n- `samkhya-core` — portable stats layer, feedback recorder, LpBound envelope,\n  `Corrector` trait. No engine dependencies. 5 sketches all shipping: HLL,\n  Bloom, Count-Min, equi-depth histogram, 2D correlated histogram.\n\nLayer 2 — engine adapters (5 production engines + 2 reservations):\n- `samkhya-datafusion` — `SamkhyaTableProvider` + `SamkhyaStatsExec` +\n  `SamkhyaOptimizerRule` three-layer integration into DataFusion 46.\n- `samkhya-duckdb` — Rust-client integration against DuckDB 1.x via\n  `bundled` feature flag.\n- `samkhya-duckdb-ext` — cxx extension scaffold (staticlib+rlib in v1.0;\n  cdylib + runtime LOAD waits on DuckDB Issue #11638).\n- `samkhya-polars` — Series-to-sketch helpers + `lazy_collect_with_feedback`\n  on polars 0.44, behind `engine` feature.\n- `samkhya-postgres` — pgrx-shaped extension, double-gated behind\n  `pg_extension` feature + `samkhya_pgrx_enabled` rustc cfg (pg17 pin per\n  WAVE5-A).\n- `samkhya-iceberg` — Puffin sidecar reader/writer with KIND-tag registration.\n- `samkhya-arrow` — Arrow IPC round-trip helpers for all 5 sketch types.\n\nLayer 3 — corrector backends + GPU + Python:\n- `samkhya-gpudb` — Layer 4 reservation. `GpuCorrector` trait +\n  `CpuFallbackCorrector` reference impl. TabPFN-2.5 backend via opt-in\n  HTTP transport (`tabpfn_http` feature). **LLM-pluggable HTTP\n  corrector** ships dual transport (Python FastAPI + Node TypeScript,\n  same wire contract) under `scripts/llm_infer_server.{py,ts}`, with\n  Anthropic / OpenAI / local Ollama / dummy backends for each. See\n  `bench-results/19_llm_corrector.md` for the end-to-end campaign.\n- `samkhya-py` — PyO3 0.22 bindings, single abi3-py39 wheel, published to\n  PyPI as `samkhya`.\n\nLayer 4 — tools:\n- `samkhya-cli` — single-binary evaluator: `build`, `decode`, `stats`,\n  `info`, `compare`.\n- `samkhya-bench` — clap CLI: `list-queries`, `run`, `compare`, `report`,\n  `train`, `calibrate`, `build-puffin`.\n- `samkhya-it` — cross-crate integration test harness (`publish = false`).\n\nWorkspace clippy `-D warnings` clean. ~266 `#[test]` blocks + 17 property\ntests; cargo-fuzz workspace (~31 M execs, 0 crashes); criterion\nmicrobenchmarks for sketches and Puffin I/O.\n\n---\n\n## Measured headlines (WAVE4-F + WAVE5-L2)\n\nsamkhya v1.0 reports the *honest* head-to-head measurement, not a projection.\n\n| Headline | Measured | CI / significance | Receipt |\n|---|---|---|---|\n| **LpJoinBound vs AGM on star-5, p=1** | **40.95×** speedup | BCa 95% CI [30.93, 47.45]; Wilcoxon W=0 paired vs AGM p=1.73×10⁻⁶, n=30 | `bench-results/07_lpbound_tightness.md` |\n| **JOB-Slow head-to-head vs DataFusion 46 (n=55 paired warm-cache, SF=1 IMDb)** | geomean **1.038×** wallclock; **17 wins / 38 ties / 0 losses**; BH-FDR rejects 24/55 | BCa 95% CI [1.026, 1.056]; Wilcoxon W=212 p=3.00×10⁻⁶ | `bench-results/18_vs_native_datafusion_wallclock.md` (WAVE4-F) |\n| **TabPFN-2.5 inference latency** (RTX 4090 Laptop, B=8 L=128) | P95 **31.15 ms** (H1-A PASS) | BCa 95% CI [29.39, 35.32], strictly below 50 ms bar | `bench-results/14_tabpfn_4090_latency.md` (WAVE5-L2) |\n| **HLL precision** (p=14, n=10⁶) | RSE **0.676%** | BCa 95% CI [0.535%, 0.848%] vs Flajolet 2007 0.8125% envelope | `bench-results/03_hll_precision_sweep.md` |\n| **L4 v3 ablation** (A2→A3) | **−1.7%** median q-error reduction (BH-sig improvement) | BCa 95% CI [−2.8%, −0.7%], Wilcoxon p=0.0209 | WAVE5-E |\n\n**Honest disclosures.** Pre-registered JOB-Slow upper bounds (≥1.6× join-heavy, ≥1.35× aggregate, ≥1.50× headline) all **FALSIFIED** by WAVE4-F. The corrector path is statistically real but the effect size is small; attributions are named in `bench-results/EVIDENCE.md` §4.2 (warm-cache only, CSV-not-Parquet, n=2 budget cap, OOM past q16a). TabPFN-2.5 q-error reduction over GBT is 7.84% (BCa [2.21, 14.62], p=1.04×10⁻⁵) — effect-direction confirmed, magnitude half the 15% pre-reg target (H1-B FALSIFIED on magnitude).\n\n### The 1000 → 42 demonstration (kept for the mechanism it proves)\n\nWithout samkhya, a 1000-row table wrapped only in DataFusion 46's default\nTableProvider reports `num_rows = 1000` to the physical plan. Wrap the same\nprovider with `SamkhyaTableProvider` plus the optimizer rule, and the physical\nplan reports `num_rows = 42`. The `stats_propagation_demo` example prints:\n*\"without rule: 1000, with rule: 42\"* — proving the corrected estimate, clamped\nby LpBound, propagates through `SamkhyaStatsExec::statistics()`. Mechanism, not\nheadline.\n\n---\n\n## Architecture\n\nThe five layers — each replaceable, each failing safely toward the engine's\nnative plan:\n\n```\n+----------------------------------------------------------------+\n| Layer 5  Pluggable corrector backend  (Corrector trait surface)\n|          GBT default · TabPFN-2.5 opt-in · LLM dual transport  |\n|          (FastAPI :8766 + TypeScript :8767), all shipping v1.0 |\n+----------------------------------------------------------------+\n| Layer 4  GPU Batch Inference  (optional, via gpudb)            |\n|          one CUDA / Metal launch scores thousands of subplans  |\n+----------------------------------------------------------------+\n| Layer 3  LpBound Envelope  (NEVER REGRESS)                     |\n|          provable upper bound; corrections clamped from above  |\n+----------------------------------------------------------------+\n| Layer 2  Feedback Recorder  (LEO / Bao / AutoSteer pattern)    |\n|          SQLite (plan, estimate, actual); residual GBT trained |\n+----------------------------------------------------------------+\n| Layer 1  Portable Stats  (Iceberg Puffin + classical sketches) |\n|          HLL / Bloom / CMS / equi-depth / correlated2D         |\n+----------------------------------------------------------------+\n```\n\nSee [ARCHITECTURE.md](./ARCHITECTURE.md) for the full developer-facing design,\nincluding data-flow diagrams and the `samkhya-core` module map.\n\n---\n\n## Cross-engine matrix\n\n| Engine     | Adapter             | Status        | Notes                                                                  |\n|------------|---------------------|---------------|------------------------------------------------------------------------|\n| DataFusion | `samkhya-datafusion`| Production    | Three-layer integration against DataFusion 46; first-class target.     |\n| DuckDB     | `samkhya-duckdb` / `samkhya-duckdb-ext` | Beta + scaffold | Rust-client path behind `bundled`; cxx extension v1.0 staticlib+rlib only; cdylib + runtime LOAD waits on DuckDB Issue #11638. |\n| Polars     | `samkhya-polars`    | Beta          | Series-to-sketch helpers behind `engine`; optimizer hook pending upstream Polars Issue #23345. |\n| Postgres   | `samkhya-postgres`  | Scaffold      | pgrx-shaped stub. Double-gated behind `pg_extension` feature + `samkhya_pgrx_enabled` rustc cfg, pg17 pin (per WAVE5-A); real planner / executor hooks v1.1 after pgrx ≥ 0.13. |\n| Iceberg    | `samkhya-iceberg`   | Production    | Puffin sidecar reader/writer with KIND-tag registration for all 5 sketch types. |\n| Arrow      | `samkhya-arrow`     | Production    | Arrow IPC round-trip helpers; byte-identical for all 5 sketch types. |\n| GPU        | `samkhya-gpudb`     | CPU prod + GPU opt-in | `GpuCorrector` trait + `CpuFallbackCorrector` reference impl. TabPFN-2.5 HTTP backend via `tabpfn_http` feature (measured P95 31.15 ms on RTX 4090 Laptop). LLM-pluggable HTTP corrector dual transport — Python FastAPI :8766 + Node TypeScript :8767, same wire contract. |\n\n---\n\n## Documentation\n\nPublic, tracked files only:\n\n- **v1.0 launch — first published on The AI Vibe:**\n  - **[Launch blog post](https://theaivibe.org/blog/samkhya-portable-cardinality-correction-rust-sdk-launch)**\n    — \"The Stats Layer Embedded Databases Have Been Waiting Eight Years\n    For.\" Punchy, narrative-first, ~10 min read. Start here.\n  - **[Formal publication page](https://theaivibe.org/publications/samkhya-portable-feedback-driven-cardinality-correction-embedded-analytics)**\n    — academic-titled companion: motivation, architecture, the honest\n    1.038× falsification, what samkhya is actually for.\n- [ARCHITECTURE.md](./ARCHITECTURE.md) — five-layer design, crate layout, data\n  flow, integration surfaces, safety guarantees, glossary.\n- [SECURITY.md](./SECURITY.md) — supported versions, disclosure policy, and\n  the GitHub Security Advisories channel.\n- [CHANGELOG.md](./CHANGELOG.md) — release history (v0.0.1 → v1.0.0).\n- [CONTRIBUTING.md](./CONTRIBUTING.md) — how to file bugs, PRs, and run the\n  test suite.\n- [REPRODUCIBILITY.md](./REPRODUCIBILITY.md) — ACM AE v1.1 reviewer entry,\n  5-step reproducer workflow.\n- [CITATION.cff](./CITATION.cff) — academic citation metadata (cff-1.2.0).\n\nSource repository: \u003chttps://github.com/singhpratech/samkhya\u003e.\n\n---\n\n## Prior work, fairly framed\n\nsamkhya stands on the shoulders of a substantial body of cardinality-estimation\nresearch — MSCN, Naru, NeuroCard, DeepDB, BayesCard, FLAT, FACE, Neo, Balsa,\nRTOS, Bao, AutoSteer, Lero, ALECE, ByteCard, PRICE, TiCard, LpBound. These are\nnot dead ends; they are prior attempts that hit the embedded-tier budget limit.\nThe 2018-2020 wave assumed a server-class DBMS with a long-lived optimizer\nprocess that could amortize a 40-300 MB model and 5-50 ms inference. The\nembedded reality — sub-50 ms cold start, sub-200 MB total memory, sub-ms\nper-estimate latency, single-query lifetimes — was outside that envelope. The\n2021-2022 critique papers (*\"Are We Ready For Learned CE?\"*, *\"In-depth Study\nof Learned CE\"*) were honest about the limitations; the production-database\nfield routed around them via adaptive query execution, a technique that is\nstructurally inapplicable to engines without a long-lived process.\n\nsamkhya's design exists to transcend the embedded-tier limitations: portable\nstats survive between sessions; the feedback recorder borrows the\n*observe-and-hint* pattern from Bao and AutoSteer (the only learned-QO pattern\nwith documented production deployment); the LpBound envelope makes cold-start\nsafety provable rather than aspirational; and the residual-correction interface\nis designed so a future foundation-model backend drops in without churn. The\nprior insights are the ones samkhya extends; the prior limitations are the\nones it is built to bypass.\n\n---\n\n## Security\n\nReport vulnerabilities through\n[GitHub Security Advisories](https://github.com/singhpratech/samkhya/security/advisories/new).\nDo not file public issues for security reports. The disclosure policy and the\nlist of supported versions are documented in [SECURITY.md](./SECURITY.md).\n\n---\n\n## License\n\nLicensed under **Apache License 2.0** (single license, explicit patent\ngrant per §3). Sole author: Prateek Singh.\n\nMatches the licensing posture of the surrounding analytical-engine\necosystem — DataFusion, Iceberg, ClickHouse, Apache Arrow itself — and\ngives every downstream user the same explicit patent grant rather than\nmaking it optional via a dual-license toggle. Full text in\n[LICENSE-APACHE](./LICENSE-APACHE).\n\n## Citations (industry-standard anchors)\n\n- Hollmann et al. — **TabPFN: Transformers solve small tabular problems.** ICLR 2023.\n- Atserias, Grohe, Marx — **Size bounds and query plans for relational joins.** PODS 2008.\n- Zhang et al. — **LpBound polynomial families.** SIGMOD 2025.\n- Leis et al. — **How good are query optimizers, really?** VLDB 2015 (Join Order Benchmark).\n- Moerkotte et al. — **Preventing bad plans by bounding the impact of cardinality estimation errors.** VLDB 2009 (q-error).\n- Efron \u0026 Tibshirani — **An Introduction to the Bootstrap**, ch. 14 (BCa). Chapman \u0026 Hall, 1993.\n- Wilcoxon — **Individual comparisons by ranking methods.** Biometrics Bulletin 1945.\n- Benjamini \u0026 Hochberg — **Controlling the false discovery rate.** JRSSB 1995.\n- Flajolet et al. — **HyperLogLog.** DMTCS 2007.\n- Bloom — **Space/time trade-offs in hash coding with allowable errors.** CACM 1970.\n- Cormode \u0026 Muthukrishnan — **An improved data stream summary: the Count-Min Sketch.** J. Algorithms 2005.\n- Ioannidis \u0026 Poosala — **Balancing histogram optimality and practicality.** SIGMOD 1996 (MaxDiff).\n- Jagadish et al. — **Optimal histograms with quality guarantees.** VLDB 1998 (V-Optimal).\n- Stillger et al. — **LEO: DB2's LEarning Optimizer.** SIGMOD 2001 (feedback-driven QO).\n- ACM Artifact Evaluation v1.1 — reproducibility-badge methodology.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsinghpratech%2Fsamkhya","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsinghpratech%2Fsamkhya","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsinghpratech%2Fsamkhya/lists"}