{"id":32370225,"url":"https://github.com/maltzsama/sumeh","last_synced_at":"2026-03-09T21:23:53.131Z","repository":{"id":263497696,"uuid":"887434569","full_name":"maltzsama/sumeh","owner":"maltzsama","description":"Sumeh — Unified Data Quality Framework Sumeh is a unified data quality validation framework supporting multiple backends (PySpark, Dask, Polars, DuckDB, Pandas) with centralized rule configuration.","archived":false,"fork":false,"pushed_at":"2026-03-09T13:04:19.000Z","size":2600,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-03-09T17:54:36.652Z","etag":null,"topics":["dask-dataframes","data","data-quality","data-quality-analysis","data-quality-assessment","data-quality-checks","data-quality-framework","data-quality-measurement","data-quality-report","duckdb","duckdb-extension","pandas","pandas-library","polars","polars-dataframe","polars-extensions","pyspark","pyspark-dataframes"],"latest_commit_sha":null,"homepage":"https://maltzsama.github.io/sumeh/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/maltzsama.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-11-12T18:31:25.000Z","updated_at":"2025-11-03T15:43:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"c3515433-685b-4635-8571-88ccc846c3ba","html_url":"https://github.com/maltzsama/sumeh","commit_stats":null,"previous_names":["maltzsama/sumeh"],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/maltzsama/sumeh","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maltzsama%2Fsumeh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maltzsama%2Fsumeh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maltzsama%2Fsumeh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maltzsama%2Fsumeh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/maltzsama","download_url":"https://codeload.github.com/maltzsama/sumeh/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/maltzsama%2Fsumeh/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30312174,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-09T20:05:46.299Z","status":"ssl_error","status_checked_at":"2026-03-09T19:57:04.425Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dask-dataframes","data","data-quality","data-quality-analysis","data-quality-assessment","data-quality-checks","data-quality-framework","data-quality-measurement","data-quality-report","duckdb","duckdb-extension","pandas","pandas-library","polars","polars-dataframe","polars-extensions","pyspark","pyspark-dataframes"],"created_at":"2025-10-24T20:24:37.565Z","updated_at":"2026-03-09T21:23:53.098Z","avatar_url":"https://github.com/maltzsama.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Python Version](https://img.shields.io/badge/python-3.10%2B-blue.svg?logo=python\u0026logoColor=white)](https://www.python.org/downloads/)\n[![Build Status](https://github.com/maltzsama/sumeh/workflows/Publish%20Python%20Package/badge.svg)](https://github.com/maltzsama/sumeh/actions)\n[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg?logo=open-source-initiative\u0026logoColor=white)](https://opensource.org/licenses/Apache-2.0)\n[![Coverage](https://codecov.io/gh/maltzsama/sumeh/branch/main/graph/badge.svg)](https://codecov.io/gh/maltzsama/sumeh)\n[![Downloads](https://img.shields.io/pypi/dm/sumeh?logo=pypi\u0026logoColor=white)](https://pypi.org/project/sumeh/)\n[![PyPI Version](https://img.shields.io/pypi/v/sumeh?color=blue\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/sumeh/)\n[![Version](https://img.shields.io/github/v/release/maltzsama/sumeh?color=blue\u0026label=version\u0026logo=github)](https://github.com/maltzsama/sumeh/releases)\n\n# \u003ch1 style=\"display: flex; align-items: center; gap: 0.5rem;\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/maltzsama/sumeh/refs/heads/main/docs/img/sumeh.svg\" alt=\"Logo\" style=\"height: 40px; width: auto; vertical-align: middle;\" /\u003e \u003cspan\u003eSumeh DQ\u003c/span\u003e \u003c/h1\u003e\n\n**Unified Data Quality Validation Framework**\n\n\n\n*One API. Fifty-plus rules. Fourteen engines. Zero compromise.*\n\n[Documentation](https://maltzsama.github.io/sumeh/) · [PyPI](https://pypi.org/project/sumeh/) · [Changelog](CHANGELOG.md)\n\n---\n\n## What is Sumeh?\n\nData quality validation is a solved problem — until you have to run the same checks on Pandas today, migrate to PySpark next quarter, and push results to BigQuery in production. Every engine has its own API, its own quirks, and its own way of breaking.\n\nSumeh provides a single, consistent interface that compiles to whatever engine is underneath. You define rules once. You run them everywhere.\n\n```python\nfrom sumeh import pandas, polars, duckdb, bigquery\nfrom sumeh.core.rules.rule_model import RuleDefinition\n\nrules = [\n    RuleDefinition(field=\"user_id\",  check_type=\"is_unique\",      threshold=1.0),\n    RuleDefinition(field=\"email\",    check_type=\"is_complete\",     threshold=1.0),\n    RuleDefinition(field=\"email\",    check_type=\"has_pattern\",     value=r\"^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$\"),\n    RuleDefinition(field=\"age\",      check_type=\"is_between\",      min_value=18, max_value=120),\n    RuleDefinition(field=\"status\",   check_type=\"is_contained_in\", allowed_values=[\"active\",\"inactive\",\"pending\"]),\n    RuleDefinition(field=\"revenue\",  check_type=\"has_mean\",        value=50_000.0, threshold=0.1),\n]\n\n# These are interchangeable — same rules, same report, different engine underneath\nreport = pandas.validate(df, rules)\nreport = polars.validate(df, rules)\nreport = duckdb.validate(con=con, df=\"orders\", rules=rules)\nreport = bigquery.validate(client=bq_client, table=\"project.dataset.orders\", rules=rules)\n\nprint(f\"Pass rate: {report.pass_rate:.2%}\")  # 83.33%\nprint(f\"Failed:    {len(report.failed)} / {report.total_rules} rules\")\n\ngood_df, bad_df = report.split()  # bifurcate clean and quarantine data\n```\n\n---\n\n## What Changed in v2.0\n\nv2.0 is a complete rewrite. The architecture is different, the API is different, and there is no dependency on cuallee.\n\n| | v1.x | v2.0 |\n|---|---|---|\n| **API style** | `validate.pandas(df, rules)` | `from sumeh import pandas; pandas.validate(df, rules)` |\n| **Return type** | `(df_errors, violations, table_summary)` tuple | `ValidationReport` object |\n| **Rule class** | `RuleDef` | `RuleDefinition` |\n| **Engines** | 6 | 14 |\n| **SQL generation** | String concatenation | SQLGlot AST — zero injection risk |\n| **PySpark bifurcation** | `.collect()` on driver | `fail_condition` Column expressions — never collects |\n| **cuallee dependency** | Required | Removed |\n| **Open SQL mode** | ❌ | ✅ Generate SQL without executing |\n| **Profiler** | ❌ | ✅ Column-level statistics |\n| **OpenMetadata exporter** | ❌ | ✅ Zero-SDK payload generation |\n\n\u003e **Migrating from v1.x?** See the [Migration Guide](#migrating-from-v1x) at the bottom.\n\n---\n\n## Table of Contents\n\n- [ Sumeh DQ ](#-sumeh-dq-)\n  - [What is Sumeh?](#what-is-sumeh)\n  - [What Changed in v2.0](#what-changed-in-v20)\n  - [Table of Contents](#table-of-contents)\n  - [Installation](#installation)\n  - [Quickstart](#quickstart)\n  - [Engines](#engines)\n    - [Batch DataFrame Engines](#batch-dataframe-engines)\n    - [SQL Engines](#sql-engines)\n    - [Streaming \\\u0026 ML Engines](#streaming--ml-engines)\n  - [Validation Rules](#validation-rules)\n    - [Completeness](#completeness)\n    - [Uniqueness](#uniqueness)\n    - [Numeric \\\u0026 Comparison](#numeric--comparison)\n    - [Membership](#membership)\n    - [Pattern](#pattern)\n    - [Date](#date)\n    - [Custom SQL](#custom-sql)\n    - [Aggregations *(Table-level)*](#aggregations-table-level)\n    - [Schema](#schema)\n  - [Defining Rules](#defining-rules)\n    - [Loading from CSV](#loading-from-csv)\n  - [The ValidationReport](#the-validationreport)\n  - [Bifurcation](#bifurcation)\n  - [Open SQL Mode](#open-sql-mode)\n  - [Schema Validation](#schema-validation)\n    - [Schema Registry DDL](#schema-registry-ddl)\n    - [Extract and Validate](#extract-and-validate)\n  - [Data Profiling](#data-profiling)\n  - [OpenMetadata Integration](#openmetadata-integration)\n  - [SQL DDL Generator](#sql-ddl-generator)\n  - [CLI](#cli)\n  - [Architecture](#architecture)\n    - [Design Decisions](#design-decisions)\n  - [Migrating from v1.x](#migrating-from-v1x)\n    - [Import pattern](#import-pattern)\n    - [Rule class](#rule-class)\n- [Loading Rules from Databases](#loading-rules-from-databases)\n  - [PostgreSQL / MySQL (via Pandas \\\u0026 SQLAlchemy)](#postgresql--mysql-via-pandas--sqlalchemy)\n  - [BigQuery (via Google Cloud Client)](#bigquery-via-google-cloud-client)\n  - [DuckDB (Native)](#duckdb-native)\n  - [Passing Rules to the Engine](#passing-rules-to-the-engine)\n    - [Rules table DDL](#rules-table-ddl)\n    - [Working with results](#working-with-results)\n    - [cuallee](#cuallee)\n  - [Contributing](#contributing)\n  - [License](#license)\n\n---\n\n## Installation\n\n```bash\n# Core (Pandas is included by default)\npip install sumeh\n\n# Batch DataFrame Engines\npip install sumeh[polars]       # Polars\npip install sumeh[pyspark]      # Apache Spark\npip install sumeh[dask]         # Dask\n\n# SQL Engines\npip install sumeh[duckdb]       # DuckDB\npip install sumeh[bigquery]     # Google BigQuery\npip install sumeh[snowflake]    # Snowflake\npip install sumeh[redshift]     # Amazon Redshift\npip install sumeh[athena]       # Amazon Athena\npip install sumeh[trino]        # Trino\npip install sumeh[doris]        # Apache Doris\n\n# Streaming \u0026 ML Engines\npip install sumeh[pyflink]      # Apache Flink\npip install sumeh[ray]          # Ray Data\n\n# Example: Installing multiple engines at once\npip install sumeh[polars,duckdb,bigquery]\n\n```\n\n**Requirements:** Python 3.10+\n\n---\n\n## Quickstart\n\n```python\nfrom sumeh import pandas as sumeh_pandas\nfrom sumeh.core.rules.rule_model import RuleDefinition\nimport pandas as pd\n\ndf = pd.read_csv(\"customers.csv\")\n\nrules = [\n    RuleDefinition(field=\"customer_id\", check_type=\"is_unique\",      threshold=1.0),\n    RuleDefinition(field=\"email\",       check_type=\"is_complete\",     threshold=1.0),\n    RuleDefinition(field=\"age\",         check_type=\"is_positive\",     threshold=0.99),\n    RuleDefinition(field=\"country\",     check_type=\"is_contained_in\", allowed_values=[\"BR\",\"US\",\"DE\",\"FR\"]),\n    RuleDefinition(field=\"revenue\",     check_type=\"has_mean\",        value=3_500.0, threshold=0.15),\n]\n\nreport = sumeh_pandas.validate(df, rules)\n\n# Summary\nprint(f\"Pass rate: {report.pass_rate:.2%}\")\nfor r in report.failed:\n    print(f\"  ✗ [{r.check_type}] {r.field} — {r.message}\")\n\n# Annotated DataFrame (_dq_errors column added per row)\nannotated = report.df\n\n# Split into clean and quarantine\ngood_df, bad_df = report.split()\nbad_df.to_parquet(\"quarantine/customers.parquet\")\n```\n\n---\n\n## Engines\n\nSumeh supports **fourteen engines** across three tiers. Every engine exposes the same `validate()` function and returns the same `ValidationReport`.\n\n### Batch DataFrame Engines\n\n| Engine | Import | Bifurcation | Notes |\n|--------|--------|:-----------:|-------|\n| **Pandas** | `from sumeh import pandas` | ✅ | Boolean mask bifurcation |\n| **Polars** | `from sumeh import polars` | ✅ | Rust-powered; `list.len()` bifurcation |\n| **PySpark** | `from sumeh import pyspark` | ✅ | `fail_condition` Column expressions — no `.collect()` |\n| **Dask** | `from sumeh import dask` | ✅ | Out-of-core parallel computing |\n\n### SQL Engines\n\nAll SQL engines share the `sql_core` compiler. Queries are built as SQLGlot AST and compiled to the target dialect at call time.\n\n| Engine | Import | Bifurcation | Notes |\n|--------|--------|:-----------:|-------|\n| **DuckDB** | `from sumeh import duckdb` | ✅ | Embedded; in-process SQL bifurcation |\n| **BigQuery** | `from sumeh import bigquery` | — | Pushes compiled SQL to BQ |\n| **Snowflake** | `from sumeh import snowflake` | — | Aggregation mode |\n| **Redshift** | `from sumeh import redshift` | — | Aggregation mode |\n| **Athena** | `from sumeh import athena` | — | Serverless S3 queries |\n| **Trino** | `from sumeh import trino` | — | Distributed SQL federation |\n| **Apache Doris** | `from sumeh import doris` | — | Real-time OLAP |\n| **Generic SQL** | `from sumeh import sql_core` | — | Query generation without execution |\n\n### Streaming \u0026 ML Engines\n\n| Engine | Import | Notes |\n|--------|--------|-------|\n| **PyFlink** | `from sumeh import pyflink` | Unbounded streams; row-level rules only |\n| **Ray Data** | `from sumeh import ray_data` | ML/AI pipelines; GPU acceleration |\n\n\u003e **Streaming note:** Table-level aggregation rules (`has_mean`, `has_cardinality`, etc.) require a full dataset. They are not compatible with unbounded streaming sources.\n\n---\n\n## Validation Rules\n\nSumeh ships **50+ rules** in 7 categories. Every rule is defined in `sumeh/core/rules/manifest.json` and runs on every engine.\n\n### Completeness\n\n| Rule | Description |\n|------|-------------|\n| `is_complete` | No null values in column |\n| `are_complete` | All specified columns are non-null |\n\n### Uniqueness\n\n| Rule | Description |\n|------|-------------|\n| `is_unique` | All values in column are distinct |\n| `are_unique` | Combination of columns is globally unique |\n| `is_primary_key` | Alias for `is_unique` |\n| `is_composite_key` | Alias for `are_unique` |\n\n### Numeric \u0026 Comparison\n\n| Rule | Description |\n|------|-------------|\n| `is_positive` | Value \u003e 0 |\n| `is_negative` | Value \u003c 0 |\n| `is_equal` | Value == `value` |\n| `is_equal_than` | Value == another column |\n| `is_greater_than` | Value \u003e `value` |\n| `is_less_than` | Value \u003c `value` |\n| `is_greater_or_equal_than` | Value \u003e= `value` |\n| `is_less_or_equal_than` | Value \u003c= `value` |\n| `is_between` | `min_value` \u003c= value \u003c= `max_value` |\n| `is_in_millions` | Value \u003e= 1,000,000 |\n| `is_in_billions` | Value \u003e= 1,000,000,000 |\n\n### Membership\n\n| Rule | Description |\n|------|-------------|\n| `is_contained_in` / `is_in` | Value is in an allowed set |\n| `not_contained_in` / `not_in` | Value is not in a forbidden set |\n\n### Pattern\n\n| Rule | Description |\n|------|-------------|\n| `has_pattern` | Value matches a regex |\n| `is_legit` | Value is non-null and non-whitespace |\n\n### Date\n\n| Rule | Description |\n|------|-------------|\n| `is_today` | Date equals today |\n| `is_yesterday` / `is_t_minus_1` | Date equals T-1 |\n| `is_t_minus_2` | Date equals T-2 |\n| `is_t_minus_3` | Date equals T-3 |\n| `is_past_date` | Date is before today |\n| `is_future_date` | Date is after today |\n| `is_date_between` | Date within a range |\n| `is_date_after` | Date after a reference |\n| `is_date_before` | Date before a reference |\n| `is_on_weekday` | Date falls on Mon–Fri |\n| `is_on_weekend` | Date falls on Sat–Sun |\n| `is_on_monday` … `is_on_sunday` | Date falls on a specific day of the week |\n| `validate_date_format` | Date string matches expected format |\n| `all_date_checks` | Runs the full date validation suite |\n\n### Custom SQL\n\n| Rule | Description |\n|------|-------------|\n| `satisfies` | Arbitrary SQL condition, e.g. `\"age \u003e= 18 AND status != 'banned'\"` |\n\n### Aggregations *(Table-level)*\n\nThese check a single metric across the full column. A rule passes when the measured value is within `threshold` percent of `value`.\n\n| Rule | Metric |\n|------|--------|\n| `has_min` | Column minimum |\n| `has_max` | Column maximum |\n| `has_mean` | Column mean |\n| `has_sum` | Column sum |\n| `has_std` | Standard deviation |\n| `has_cardinality` | Count of distinct values |\n| `has_entropy` | Shannon entropy |\n| `has_infogain` | Information gain |\n\n### Schema\n\n| Rule | Description |\n|------|-------------|\n| `validate_schema` | Validates DataFrame structure against a registered schema |\n\n---\n\n## Defining Rules\n\nRules are `RuleDefinition` dataclasses.\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `field` | `str \\| list[str]` | Column(s) to validate. Use a list for multi-column rules. |\n| `check_type` | `str` | Rule identifier from the manifest |\n| `threshold` | `float` | For row-level rules: minimum pass rate (0.0–1.0). For aggregations: maximum allowed relative deviation from `value`. Default `1.0`. |\n| `value` | `Any` | Expected value for aggregation and pattern rules |\n| `min_value` / `max_value` | `Any` | Bounds for `is_between` and range rules |\n| `allowed_values` | `list` | Allowed set for membership rules |\n| `execute` | `bool` | `False` to skip without removing the rule |\n\n```python\nfrom sumeh.core.rules.rule_model import RuleDefinition\n\n# Row-level: threshold = minimum pass rate across all rows\nRuleDefinition(field=\"email\",        check_type=\"is_complete\",     threshold=1.0)\nRuleDefinition(field=[\"name\",\"dob\"], check_type=\"are_complete\",    threshold=0.95)\nRuleDefinition(field=\"user_id\",      check_type=\"is_unique\",       threshold=1.0)\nRuleDefinition(field=[\"id\",\"date\"],  check_type=\"are_unique\",      threshold=1.0)\nRuleDefinition(field=\"age\",          check_type=\"is_positive\",     threshold=0.99)\nRuleDefinition(field=\"score\",        check_type=\"is_between\",      min_value=0, max_value=100)\nRuleDefinition(field=\"status\",       check_type=\"is_contained_in\", allowed_values=[\"A\",\"B\",\"C\"])\nRuleDefinition(field=\"email\",        check_type=\"has_pattern\",     value=r\"^[\\w.-]+@[\\w.-]+\\.[a-zA-Z]{2,}$\")\nRuleDefinition(field=\"created_at\",   check_type=\"is_past_date\",    threshold=1.0)\nRuleDefinition(field=\"*\",            check_type=\"satisfies\",       value=\"age \u003e= 18 AND status != 'banned'\")\n\n# Table-level: threshold = allowed relative deviation from expected value\n# threshold=0.1 → actual metric must be within ±10% of value\nRuleDefinition(field=\"age\",          check_type=\"has_mean\",        value=35.0,    threshold=0.10)\nRuleDefinition(field=\"salary\",       check_type=\"has_min\",         value=1_000.0, threshold=0.05)\nRuleDefinition(field=\"category\",     check_type=\"has_cardinality\", value=5,       threshold=0.0)\n```\n\n### Loading from CSV\n\n```csv\nfield,check_type,threshold,value,execute\ncustomer_id,is_unique,1.0,,true\nemail,is_complete,1.0,,true\nemail,has_pattern,1.0,\"^[\\w\\.-]+@[\\w\\.-]+\\.[a-zA-Z]{2,}$\",true\nage,is_between,1.0,\"[18, 120]\",true\nstatus,is_contained_in,1.0,\"['active','inactive','pending']\",true\n\"[first_name,last_name]\",are_complete,0.95,,true\nsalary,has_mean,0.1,50000,true\n```\n\n```python\nfrom sumeh.config.csv import load_rules_csv\nrules = load_rules_csv(\"rules.csv\")\n```\n\nValues are automatically parsed to the correct Python type (int, float, list, regex, date range). Multi-column fields use the `\"[col1,col2]\"` notation.\n\n---\n\n## The ValidationReport\n\nEvery `validate()` call returns a `ValidationReport`. This is the core object of v2.0.\n\n```python\nreport = pandas.validate(df, rules)\n\n# Aggregate metrics\nreport.pass_rate           # float — fraction of rules that passed\nreport.total_rules         # int\nreport.passed              # list[ValidationResult]\nreport.failed              # list[ValidationResult]\n\n# Per-rule results\nfor result in report.results:\n    result.check_type      # \"is_complete\"\n    result.field           # \"email\"\n    result.status          # ValidationStatus.PASS | FAIL\n    result.pass_rate       # 0.973 — 97.3% of rows passed\n    result.actual_value    # measured metric\n    result.expected_value  # expected metric\n    result.message         # \"27 null values found in 1000 rows\"\n\n# Annotated DataFrame: adds _dq_errors column to each row\nannotated_df = report.df\n\n```\n\n---\n\n## Bifurcation\n\nBifurcation splits the validated dataset into **clean rows** and **quarantine rows** in a single pass. No double scanning, no extra joins.\n\n```python\nreport = pandas.validate(df, rules)\ngood_df, bad_df = report.split()\n\n# good_df — original columns, _dq_errors removed\n# bad_df  — original columns + _dq_errors (list of failed rule names per row)\n\nbad_df.to_parquet(\"quarantine/2024-06-01.parquet\")\n```\n\nThe same pattern works across all bifurcation-capable engines:\n\n```python\n# Polars\ngood_df, bad_df = polars.validate(df, rules).split()\n# → pl.DataFrame, pl.DataFrame\n\n# PySpark — fully lazy, no .collect(), no driver OOM\nreport = pyspark.validate(spark, df, rules)\ngood_df, bad_df = report.split()\ngood_df.write.parquet(\"s3://bucket/clean/\")\nbad_df.write.parquet(\"s3://bucket/quarantine/\")\n\n# Dask\ngood_df, bad_df = dask.validate(df, rules).split()\n# → dask.DataFrame, dask.DataFrame\n\n# DuckDB — bifurcation at the SQL layer\nreport = duckdb.validate(con=con, df=\"stg_orders\", rules=rules, bifurcate=True)\ngood_rel, bad_rel = report.split()\n# → DuckDBPyRelation, DuckDBPyRelation\ngood_rel.write_parquet(\"clean.parquet\")\n```\n\n\u003e Aggregation rules (`has_mean`, `has_cardinality`, etc.) are table-level checks — they are evaluated and reported, but they have no row-level counterpart and do not affect which rows end up in `bad_df`.\n\n---\n\n## Open SQL Mode\n\nGenerate the full validation SQL for any dialect without executing it. Useful for auditing, CI dry-runs, or submitting to an external scheduler.\n\n```python\nfrom sumeh import sql_core\nfrom sumeh.core.rules.rule_model import RuleDefinition\n\nrules = [\n    RuleDefinition(field=\"user_id\", check_type=\"is_unique\",   threshold=1.0),\n    RuleDefinition(field=\"email\",   check_type=\"is_complete\", threshold=1.0),\n    RuleDefinition(field=\"age\",     check_type=\"has_mean\",    value=35.0, threshold=0.1),\n]\n\nsql = sql_core.get_validation_sql(\n    table=\"bronze.stg_transactions\",\n    rules=rules,\n    dialect=\"bigquery\",  # \"snowflake\", \"duckdb\", \"trino\", \"spark\", \"postgres\", ...\n)\n\nprint(sql)\n# SELECT\n#   CAST(COUNT(user_id) AS FLOAT64) / COUNT(*) AS is_unique__user_id,\n#   CAST(COUNT(email) AS FLOAT64) / COUNT(*) AS is_complete__email,\n#   AVG(age) AS has_mean__age\n# FROM bronze.stg_transactions\n```\n\nAll SQL is built through SQLGlot's AST — no string concatenation, no injection surface, full dialect normalization.\n\n---\n\n## Schema Validation\n\nValidate the structure of a DataFrame against a schema stored in any supported backend.\n\n### Schema Registry DDL\n\n```sql\nCREATE TABLE schema_registry (\n    id            INTEGER PRIMARY KEY,\n    environment   VARCHAR(50),     -- 'prod', 'staging', 'dev'\n    source_type   VARCHAR(50),     -- 'bigquery', 'mysql', etc.\n    database_name VARCHAR(100),\n    catalog_name  VARCHAR(100),    -- Databricks Unity Catalog\n    schema_name   VARCHAR(100),    -- PostgreSQL schema\n    table_name    VARCHAR(100),\n    field         VARCHAR(100),\n    data_type     VARCHAR(50),\n    nullable      BOOLEAN,\n    max_length    INTEGER,\n    comment       TEXT,\n    created_at    TIMESTAMP,\n    updated_at    TIMESTAMP\n);\n```\n\n### Extract and Validate\n\n```python\nfrom sumeh import extract_schema, validate_schema, get_schema_config\nimport pandas as pd\n\ndf = pd.read_csv(\"users.csv\")\n\n# Extract what the DataFrame actually has\nactual_schema = extract_schema.pandas(df)\n\n# Load what it should have\nexpected_schema = get_schema_config.csv(\"schema_registry.csv\", table=\"users\")\n# or: get_schema_config.bigquery(project_id=..., dataset_id=..., table_id=\"users\")\n# or: get_schema_config.postgresql(host=..., schema=\"public\", table=\"users\")\n\n# Compare\nis_valid, errors = validate_schema.pandas(df, expected_schema)\n\nif not is_valid:\n    for field, error in errors:\n        print(f\"  ✗ {field}: {error}\")\n```\n\nAvailable for all engines: `extract_schema.{engine}()` and `validate_schema.{engine}()` for pandas, polars, pyspark, and duckdb.\n\n---\n\n## Data Profiling\n\nGenerate column-level statistics — without writing any validation rules.\n\n```python\nfrom sumeh.core.services.profiler import profile\n\nstats = profile(df)\n\nprint(stats[\"table_stats\"])\n# { \"total_rows\": 10000, \"columns_count\": 12 }\n\nfor col, s in stats[\"column_profiles\"].items():\n    print(f\"{col}: nulls={s['null_count']}, distinct={s['distinct_count']}, mean={s.get('mean')}\")\n```\n\nThe profiler output is directly consumable by the OpenMetadata exporter — see below.\n\n---\n\n## OpenMetadata Integration\n\nExport validation results and profiling statistics to OpenMetadata **without the `openmetadata-ingestion` SDK**.\n\n```python\nfrom sumeh.exporters.openmetadata import OpenMetadataExport\nfrom sumeh.core.services.profiler import profile\nimport requests\n\nexporter = OpenMetadataExport(table_fqn=\"iceberg.bronze.stg_transactions\")\n\n# --- Validation payloads ---\npayload = exporter.validation(report)\n\n# payload[\"definitions\"] → list of CreateTestCaseRequest dicts\nfor definition in payload[\"definitions\"]:\n    requests.post(\n        f\"{om_url}/api/v1/dataQuality/testCases\",\n        json=definition, headers=auth_headers\n    )\n\n# payload[\"results\"] → list of TestCaseResult dicts\nfor result in payload[\"results\"]:\n    fqn = result[\"test_case_fqn\"]\n    requests.post(\n        f\"{om_url}/api/v1/dataQuality/testCases/{fqn}/testCaseResult\",\n        json=result[\"payload\"], headers=auth_headers\n    )\n\n# --- Profiling payload ---\nstats = profile(df)\nprofile_payload = exporter.profile(stats)\nrequests.put(\n    f\"{om_url}/api/v1/tables/{table_id}/tableProfile\",\n    json=profile_payload, headers=auth_headers\n)\n```\n\nThe exporter is pure Python with zero I/O. It generates dicts. You own the HTTP calls, the auth, and the retry logic. Nothing is ever sent without your explicit call.\n\n---\n\n## SQL DDL Generator\n\nGenerate `rules` and `schema_registry` table DDL for 17+ SQL dialects.\n\n```python\nfrom sumeh.generators import SQLGenerator\n\n# PostgreSQL\nprint(SQLGenerator.generate(table=\"rules\", dialect=\"postgres\", schema=\"public\"))\n\n# BigQuery with partitioning and clustering\nprint(SQLGenerator.generate(\n    table=\"schema_registry\",\n    dialect=\"bigquery\",\n    schema=\"my_dataset\",\n    partition_by=\"DATE(created_at)\",\n    cluster_by=[\"table_name\", \"environment\"]\n))\n\n# Snowflake with clustering key\nprint(SQLGenerator.generate(\n    table=\"rules\",\n    dialect=\"snowflake\",\n    cluster_by=[\"environment\", \"table_name\"]\n))\n\n# Redshift with distribution and sort keys\nprint(SQLGenerator.generate(\n    table=\"rules\",\n    dialect=\"redshift\",\n    distkey=\"table_name\",\n    sortkey=[\"created_at\", \"environment\"]\n))\n\n# Transpile SQL between dialects\ntranspiled = SQLGenerator.transpile(\n    \"SELECT * FROM users WHERE created_at \u003e= CURRENT_DATE - 7\",\n    from_dialect=\"postgres\",\n    to_dialect=\"bigquery\"\n)\n\n# Introspect\nprint(SQLGenerator.list_dialects())  # ['athena', 'bigquery', 'databricks', 'duckdb', ...]\nprint(SQLGenerator.list_tables())    # ['rules', 'schema_registry']\n```\n\n---\n\n## CLI\n\n```bash\n# Validate a file — defaults to Pandas engine\nsumeh validate data.csv rules.csv\n\n# Choose engine\nsumeh validate data.parquet rules.csv --engine polars\nsumeh validate data.csv rules.csv     --engine duckdb --format json\n\n# Save clean and quarantine splits\nsumeh validate data.csv rules.csv \\\n  --output     clean/data.csv     \\\n  --quarantine quarantine/data.csv\n\n# CI/CD gate — exits with code 1 if any rule fails\nsumeh validate data.csv rules.csv --fail-on-error\n\n# Generate validation SQL without executing it\nsumeh sql rules.csv --table bronze.orders --dialect bigquery\nsumeh sql rules.csv --table bronze.orders --dialect snowflake\n\n# Schema operations\nsumeh schema extract  --data data.csv --output schema.json\nsumeh schema validate --data data.csv --registry schema_registry.csv\n\n# DDL generation\nsumeh ddl generate --table rules           --dialect postgres\nsumeh ddl generate --table schema_registry --dialect bigquery\n\n# Rule introspection\nsumeh rules list\nsumeh rules info is_complete\nsumeh rules search \"date\"\nsumeh rules template\n\n\n# System info\nsumeh info\n```\n\n---\n\n## Architecture\n\n```\nsumeh/\n│\n├── core/\n│   ├── base/\n│   │   └── protocols.py        # IDataFrameValidator, IExporter — engine contracts\n│   ├── models/\n│   │   ├── validation.py       # ValidationReport, ValidationResult, ValidationStatus\n│   │   └── metrics.py          # MetricResult\n│   ├── rules/\n│   │   ├── manifest.json       # 50+ rule definitions — single source of truth\n│   │   └── rule_model.py       # RuleDefinition dataclass\n│   ├── logic/\n│   │   └── comparators.py      # Constraint classes per category\n│   ├── services/\n│   │   └── profiler/           # Column-level statistics\n│   └── io.py                   # load_data / save_data helpers\n│\n├── engines/\n│   ├── sql_core/               # Shared SQL compilation layer\n│   │   ├── analyzers.py        # check_type → SQLGlot AST expression\n│   │   ├── compiler.py         # Assembles SELECT from a rule list\n│   │   ├── validator.py        # Maps SQL result row → ValidationResult\n│   │   └── registry.py         # check_type → (Analyzer, Constraint)\n│   │\n│   ├── pandas/                 # Boolean mask bifurcation\n│   ├── polars/                 # list.len() bifurcation\n│   ├── pyspark/                # fail_condition Column expressions — no .collect()\n│   ├── dask/                   # Out-of-core parallel\n│   ├── duckdb/                 # sql_core + in-process SQL bifurcation\n│   ├── bigquery/               # sql_core + BigQuery client\n│   ├── snowflake/              # sql_core + Snowflake connector\n│   ├── redshift/               # sql_core + Redshift\n│   ├── athena/                 # sql_core + Athena\n│   ├── trino/                  # sql_core + Trino\n│   ├── doris/                  # sql_core + Apache Doris\n│   ├── pyflink/                # PyFlink streaming UDF engine\n│   └── ray_data/               # Ray Data ML engine\n│\n├── config/                     # Rule loading backends\n├── exporters/\n│   └── openmetadata.py         # Zero-SDK OpenMetadata payload generator\n├── generators/\n│   ├── ddl.py                  # SQL DDL for 17+ dialects\n│   └── transpiler.py           # SQLGlot-based dialect transpiler\n└── cli/\n    └── commands/               # validate, sql, ddl, schema, rules\n```\n\n### Design Decisions\n\n**Namespace-first API.** `from sumeh import pandas; pandas.validate(df, rules)` — not `validate(df, rules, engine=\"pandas\")`. The engine is resolved at import time. Errors are immediate and specific. IDEs understand the full call signature. There is no internal string dispatcher routing at runtime.\n\n**Analyzer / Constraint separation.** An `Analyzer` knows how to measure a metric (compute the null rate as a SQLGlot expression, or as a vectorized Pandas operation). A `Constraint` knows how to decide whether that metric satisfies the rule. Both are small, independently testable, and replaceable. Adding a new check type means writing one of each.\n\n**SQLGlot AST for all SQL.** No SQL string concatenation anywhere in the codebase. Every expression is a typed SQLGlot AST node compiled to the target dialect at call time. This eliminates SQL injection, escape sequence bugs, and the silent drift that comes from dialect-specific string templates.\n\n**Fail-condition pattern for PySpark.** Analyzers return `Column` expressions applied lazily across the cluster. `.collect()` is never called — not even for sampling. Calling `.collect()` on a dataset of any real scale is a driver OOM waiting to happen. The full validation result is computed in a single distributed aggregation pass.\n\n**Single-pass bifurcation.** Validation and the good/bad row split happen in one scan. The `_dq_errors` array column is populated per row during the same pass that computes per-rule metrics. The dataset is never read twice.\n\n---\n\n## Migrating from v1.x\n\n### Import pattern\n\n```python\n# v1.x\nfrom sumeh import validate, summarize\ndf_errors, violations, table_summary = validate.pandas(df, rules)\n\n# v2.0\nfrom sumeh import pandas\nreport = pandas.validate(df, rules)\n```\n\n### Rule class\n\n```python\n# v1.x\nfrom sumeh.core.rules import RuleDef\nrule = RuleDef(field=\"email\", check_type=\"is_complete\", threshold=0.99)\n\n# v2.0\nfrom sumeh.core.rules.rule_model import RuleDefinition\nrule = RuleDefinition(field=\"email\", check_type=\"is_complete\", threshold=0.99)\n```\n\n# Loading Rules from Databases\n\nIn Sumeh v2.0, built-in proprietary database connectors (`get_rules_config`) have been removed to keep the core library lightweight and give you full control over connection management. \n\nYou now load rule configurations using standard Python data libraries (like Pandas, SQLAlchemy, or DuckDB) and map the results directly to the `RuleDefinition` dataclass.\n\n## PostgreSQL / MySQL (via Pandas \u0026 SQLAlchemy)\n\n```python\nimport pandas as pd\nfrom sqlalchemy import create_engine\nfrom sumeh.core.rules.rule_model import RuleDefinition\n\n# 1. Connect and query your rules table\nengine = create_engine(\"postgresql://user:secret@localhost:5432/mydb\")\nquery = \"SELECT * FROM public.dq_rules WHERE execute = true\"\ndf_rules = pd.read_sql(query, engine)\n\n# 2. Map DataFrame rows to RuleDefinition objects\nrules = [RuleDefinition(**row) for row in df_rules.to_dict(orient=\"records\")]\n\n```\n\n## BigQuery (via Google Cloud Client)\n\n```python\nfrom google.cloud import bigquery\nfrom sumeh.core.rules.rule_model import RuleDefinition\n\nclient = bigquery.Client(project=\"my-project\")\nquery = \"SELECT * FROM `my-project.my_dataset.dq_rules` WHERE execute = TRUE\"\n\n# Fetch results and convert directly\nrecords = client.query(query).result()\nrules = [RuleDefinition(**dict(row)) for row in records]\n\n```\n\n## DuckDB (Native)\n\n```python\nimport duckdb\nfrom sumeh.core.rules.rule_model import RuleDefinition\n\nconn = duckdb.connect(\"warehouse.db\")\n\n# Fetch directly as a Pandas DataFrame\ndf_rules = conn.execute(\"SELECT * FROM dq_rules WHERE execute = true\").df()\n\n# Map to dataclass\nrules = [RuleDefinition(**row) for row in df_rules.to_dict(orient=\"records\")]\n\n```\n\n## Passing Rules to the Engine\n\nOnce your `rules` list is populated, you pass it to any engine exactly the same way:\n\n```python\nfrom sumeh import pyspark\n\n# Validate using the loaded database rules\nreport = pyspark.validate(spark_session, df, rules)\n\n```\n\n### Rules table DDL\n\n```sql\nCREATE TABLE dq_rules (\n    id            INTEGER PRIMARY KEY,\n    environment   VARCHAR(50)  NOT NULL,\n    source_type   VARCHAR(50)  NOT NULL,\n    database_name VARCHAR(255) NOT NULL,\n    catalog_name  VARCHAR(255),\n    schema_name   VARCHAR(255),\n    table_name    VARCHAR(255) NOT NULL,\n    field         VARCHAR(255) NOT NULL,\n    level         VARCHAR(100) NOT NULL,   -- 'ROW' or 'TABLE'\n    category      VARCHAR(100) NOT NULL,   -- 'completeness', 'uniqueness', ...\n    check_type    VARCHAR(100) NOT NULL,\n    value         TEXT,\n    threshold     FLOAT        DEFAULT 1.0,\n    execute       BOOLEAN      DEFAULT TRUE,\n    created_at    TIMESTAMP    DEFAULT CURRENT_TIMESTAMP,\n    updated_at    TIMESTAMP\n);\n```\n\nGenerate this DDL for any dialect with `sumeh sql generate --table rules --dialect \u003cdialect\u003e`.\n\n---\n\n### Working with results\n\n```python\n# v1.x — unpack 3-tuple\ndf_errors, violations, table_summary = validate.pandas(df, rules)\nsummary = summarize.pandas((df_errors, violations, table_summary), rules, len(df))\n\n# v2.0 — everything lives on the report\nreport   = pandas.validate(df, rules)\nsummary  = report.summary()          # dict\ngood_df, bad_df = report.split()     # bifurcation\nannotated = report.df                # DataFrame with _dq_errors column\n```\n\n### cuallee\n\nThe `cuallee` backend has been removed. v2.0 has its own validation engine built from scratch.\n\n---\n\n## Contributing\n\n```bash\ngit clone https://github.com/maltzsama/sumeh.git\ncd sumeh\ngit checkout develop\npoetry install --with dev\n\n# All tests\npoetry run pytest\n\n# Engine-specific\npoetry run pytest tests/engines/test_pandas.py  -v\npoetry run pytest tests/engines/test_polars.py  -v\npoetry run pytest tests/engines/test_duckdb.py  -v\npoetry run pytest tests/engines/test_pyspark.py -v\n```\n\nTo add a new rule:\n\n1. Add the definition to `sumeh/core/rules/manifest.json`\n2. Implement an `Analyzer` in the target engine's `analyzers.py`\n3. Register it in the engine's `registry.py`\n4. Write tests\n\nReleases are automated via semantic-release. Merging to `main` with a conventional commit triggers versioning, changelog generation, and PyPI publishing via Trusted Publishers.\n\n---\n\n## License\n\nLicensed under the [Apache License 2.0](LICENSE).\n\n---\n\nBuilt by [@maltzsama](https://github.com/maltzsama)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaltzsama%2Fsumeh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmaltzsama%2Fsumeh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmaltzsama%2Fsumeh/lists"}