{"id":28135549,"url":"https://github.com/sparkdq-community/sparkdq","last_synced_at":"2025-07-02T21:05:34.211Z","repository":{"id":290004608,"uuid":"972481781","full_name":"sparkdq-community/sparkdq","owner":"sparkdq-community","description":"A declarative PySpark framework for row- and aggregate-level data quality validation.","archived":false,"fork":false,"pushed_at":"2025-05-28T16:39:42.000Z","size":7787,"stargazers_count":46,"open_issues_count":0,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-28T17:48:07.840Z","etag":null,"topics":["data-check","data-engineering","data-quality","data-validation","data-verification","dq-framework","pyspark","pyspark-validation","spark-data-quality"],"latest_commit_sha":null,"homepage":"https://sparkdq-community.github.io/sparkdq/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sparkdq-community.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-25T06:38:50.000Z","updated_at":"2025-05-28T16:38:47.000Z","dependencies_parsed_at":"2025-05-14T15:19:12.837Z","dependency_job_id":"6926b83c-185c-41f9-ab6c-66bd76e29ad2","html_url":"https://github.com/sparkdq-community/sparkdq","commit_stats":null,"previous_names":["sparkdq-community/sparkdq"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/sparkdq-community/sparkdq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparkdq-community%2Fsparkdq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparkdq-community%2Fsparkdq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparkdq-community%2Fsparkdq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparkdq-community%2Fsparkdq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sparkdq-community","download_url":"https://codeload.github.com/sparkdq-community/sparkdq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sparkdq-community%2Fsparkdq/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263215288,"owners_count":23431894,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-check","data-engineering","data-quality","data-validation","data-verification","dq-framework","pyspark","pyspark-validation","spark-data-quality"],"created_at":"2025-05-14T15:19:05.566Z","updated_at":"2025-07-02T21:05:34.045Z","avatar_url":"https://github.com/sparkdq-community.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![CI Pipeline](https://github.com/sparkdq-community/sparkdq/actions/workflows/ci.yaml/badge.svg)](https://github.com/sparkdq-community/sparkdq/actions/workflows/ci.yaml)\n[![codecov](https://codecov.io/gh/sparkdq-community/sparkdq/branch/main/graph/badge.svg?token=3TVZE8J2DN)](https://codecov.io/gh/sparkdq-community/sparkdq)\n[![Docs](https://img.shields.io/badge/docs-online-green.svg)](https://sparkdq-community.github.io/sparkdq/)\n[![PyPI version](https://badge.fury.io/py/sparkdq.svg)](https://pypi.org/project/sparkdq/)\n[![Python Versions](https://img.shields.io/badge/python-3.10%20|%203.11%20|%203.12-blue.svg)](https://github.com/sparkdq-community/sparkdq)\n[![License: Apache-2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)\n\n# SparkDQ — Data Quality Validation for Apache Spark\n\nMost data quality frameworks weren’t designed with PySpark in mind. They aren’t Spark-native and often lack proper support for declarative pipelines. Instead of integrating seamlessly, they require you to build custom wrappers around them just to fit into production workflows. This adds complexity and makes your pipelines harder to maintain. On top of that, many frameworks only validate data after processing — so you can’t react dynamically or fail early when data issues occur.\n\n**SparkDQ** takes a different approach. It’s built specifically for PySpark — so you can define and run data quality checks directly inside your Spark pipelines, using Python. Whether you're validating incoming data, verifying outputs before persistence, or enforcing assumptions in your dataflow: SparkDQ helps you catch issues early, without adding complexity.\n\n\u003c!-- doc-link-start --\u003e\n🚀  See the [official documentation](https://sparkdq-community.github.io/sparkdq/) to learn more.\n\u003c!-- doc-link-end --\u003e\n\n## Quickstart Examples\n\nSparkDQ lets you define checks either using a **Python-native** interface or via **declarative configuration** (e.g. YAML, JSON, or database-driven). Regardless of how you define them, all checks are added to a `CheckSet` — which you pass to the validation engine. That’s it! Choose the style that fits your use case, and SparkDQ takes care of the rest.\n\n### Python-Native Approach\n\n```python\nfrom pyspark.sql import SparkSession\n\nfrom sparkdq.checks import NullCheckConfig\nfrom sparkdq.engine import BatchDQEngine\nfrom sparkdq.management import CheckSet\n\nspark = SparkSession.builder.getOrCreate()\n\ndf = spark.createDataFrame([\n    {\"id\": 1, \"name\": \"Alice\"},\n    {\"id\": 2, \"name\": None},\n    {\"id\": 3, \"name\": \"Bob\"},\n])\n\n# Define checks using the Python-native interface (no external config needed)\ncheck_set = CheckSet()\ncheck_set.add_check(NullCheckConfig(check_id=\"my-null-check\", columns=[\"name\"]))\n\nresult = BatchDQEngine(check_set).run_batch(df)\nprint(result.summary())\n```\n\n### Declarative Approach\n\n```python\nfrom pyspark.sql import SparkSession\n\nfrom sparkdq.engine import BatchDQEngine\nfrom sparkdq.management import CheckSet\n\nspark = SparkSession.builder.getOrCreate()\n\ndf = spark.createDataFrame(\n    [\n        {\"id\": 1, \"name\": \"Alice\"},\n        {\"id\": 2, \"name\": None},\n        {\"id\": 3, \"name\": \"Bob\"},\n    ]\n)\n\n# Declarative configuration via dictionary\n# Could be loaded from YAML, JSON, or any external system\ncheck_definitions = [\n    {\"check-id\": \"my-null-check\", \"check\": \"null-check\", \"columns\": [\"name\"]},\n]\ncheck_set = CheckSet()\ncheck_set.add_checks_from_dicts(check_definitions)\n\nresult = BatchDQEngine(check_set).run_batch(df)\nprint(result.summary())\n```\n\nSparkDQ is designed to integrate seamlessly into real-world systems. Instead of relying on a custom DSL or\nrigid schemas, it accepts plain Python dictionaries for check definitions. This makes it easy to load checks\nfrom YAML or JSON files, configuration tables in databases, or even remote APIs — enabling smooth integration\ninto orchestration tools, CI pipelines, and data contract workflows.\n\n## Installation\n\nInstall the latest stable version using pip:\n\n```\npip install sparkdq\n```\n\nAlternatively, if you're using uv, a fast and modern Python package manager:\n\n```\nuv add sparkdq\n```\n\nThe framework supports Python 3.10+ and is fully tested with PySpark 3.5.x. If you're running SparkDQ outside\nof managed platforms like Databricks, AWS Glue, or EMR, make sure Spark is installed and properly\nconfigured on your system. You can install it via your package manager or by following the official\n[Installation Guide](https://spark.apache.org/docs/latest/api/python/getting_started/install.html).\n\n## Why SparkDQ?\n\n* ✅ **Robust Validation Layer**: Clean separation of check definition, execution, and reporting\n\n* ✅ **Declarative or Programmatic**: Define checks via config files or directly in Python\n\n* ✅ **Severity-Aware**: Built-in distinction between warning and critical violations\n\n* ✅ **Row \u0026 Aggregate Logic**: Supports both record-level and dataset-wide constraints\n\n* ✅ **Typed \u0026 Tested**: Built with type safety, testability, and extensibility in mind\n\n* ✅ **Zero Overhead**: Pure PySpark, no heavy dependencies\n\n## Typical Use Cases\n\nSparkDQ is built for modern data platforms that demand trust, transparency, and resilience.\nIt helps teams enforce quality standards early and consistently — across ingestion, transformation, and delivery layers.\n\nWhether you're building a real-time ingestion pipeline or curating a data product for thousands of downstream users,\nSparkDQ lets you define and execute checks that are precise, scalable, and easy to maintain.\n\n**Common Scenarios**:\n\n* ✅ Validating raw ingestion data\n\n* ✅ Enforcing schema and content rules before persisting to a lakehouse (Delta, Iceberg, Hudi)\n\n* ✅ Asserting quality conditions before analytics or ML training jobs\n\n* ✅ Flagging critical violations in batch pipelines via structured summaries and alerts\n\n* ✅ Driving Data Contracts: Use declarative checks in CI pipelines to catch issues before deployment\n\n## Let’s Build Better Data Together\n\n⭐️ Found this useful? Give it a star and help spread the word!\n\n📣 Questions, feedback, or ideas? Open an issue or discussion — we’d love to hear from you.\n\n🤝 Want to contribute? Check out [CONTRIBUTING.md](https://github.com/sparkdq-community/sparkdq/blob/main/CONTRIBUTING.md) to get started.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsparkdq-community%2Fsparkdq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsparkdq-community%2Fsparkdq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsparkdq-community%2Fsparkdq/lists"}