{"id":13680766,"url":"https://github.com/awslabs/python-deequ","last_synced_at":"2026-01-16T07:31:48.208Z","repository":{"id":37545664,"uuid":"311469500","full_name":"awslabs/python-deequ","owner":"awslabs","description":"Python API for Deequ","archived":false,"fork":false,"pushed_at":"2026-01-13T18:08:19.000Z","size":3482,"stargazers_count":809,"open_issues_count":122,"forks_count":148,"subscribers_count":15,"default_branch":"master","last_synced_at":"2026-01-13T19:29:27.740Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/awslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-11-09T21:28:29.000Z","updated_at":"2026-01-11T03:36:06.000Z","dependencies_parsed_at":"2023-12-06T03:31:26.750Z","dependency_job_id":"ebb38552-51b3-4cd8-b731-7e81d78d3df5","html_url":"https://github.com/awslabs/python-deequ","commit_stats":{"total_commits":43,"total_committers":11,"mean_commits":3.909090909090909,"dds":0.6511627906976745,"last_synced_commit":"87ca6d5b3b30ea86784d894ea26d8a7ddafd910b"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":"amazon-archives/__template_Apache-2.0","purl":"pkg:github/awslabs/python-deequ","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fpython-deequ","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fpython-deequ/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fpython-deequ/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fpython-deequ/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/awslabs","download_url":"https://codeload.github.com/awslabs/python-deequ/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fpython-deequ/sbom","scorecard":{"id":219635,"data":{"date":"2025-08-11","repo":{"name":"github.com/awslabs/python-deequ","commit":"ca8e9e1dba9303a960411df6d58b01ef6dd87c12"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":4.3,"checks":[{"name":"Code-Review","score":8,"reason":"Found 21/25 approved changesets -- score normalized to 8","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/base.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/base.yml:18: update your workflow using https://app.stepsecurity.io/secureworkflow/awslabs/python-deequ/base.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/base.yml:20: update your workflow using https://app.stepsecurity.io/secureworkflow/awslabs/python-deequ/base.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/base.yml:25: update your workflow using https://app.stepsecurity.io/secureworkflow/awslabs/python-deequ/base.yml/master?enable=pin","Warn: containerImage not pinned by hash: Dockerfile:1: pin your Docker image by updating ubuntu:22.04 to ubuntu:22.04@sha256:1aa979d85661c488ce030ac292876cf6ed04535d3a237e49f61542d8e5de5ae0","Warn: pipCommand not pinned by hash: Dockerfile:17","Warn: pipCommand not pinned by hash: .github/workflows/base.yml:35","Warn: pipCommand not pinned by hash: .github/workflows/base.yml:36","Info:   0 out of   3 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 containerImage dependencies pinned","Info:   0 out of   3 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Security-Policy","score":10,"reason":"security policy file detected","details":["Info: security policy file detected: github.com/awslabs/.github/SECURITY.md:1","Info: Found linked content: github.com/awslabs/.github/SECURITY.md:1","Info: Found disclosure, vulnerability, and/or timelines in security policy: github.com/awslabs/.github/SECURITY.md:1","Info: Found text in security policy: github.com/awslabs/.github/SECURITY.md:1"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":0,"reason":"26 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2020-73","Warn: Project is vulnerable to: PYSEC-2023-44 / GHSA-329j-jfvr-rhr6","Warn: Project is vulnerable to: PYSEC-2022-42976 / GHSA-43xg-8wmj-cw8h","Warn: Project is vulnerable to: PYSEC-2022-236 / GHSA-4x9r-j582-cgr8","Warn: Project is vulnerable to: PYSEC-2018-25 / GHSA-6mqq-8r44-vmjc","Warn: Project is vulnerable to: PYSEC-2017-147 / GHSA-8rhc-48pp-52gr","Warn: Project is vulnerable to: PYSEC-2022-186 / GHSA-9rr6-jpg7-9jg6","Warn: Project is vulnerable to: PYSEC-2019-114 / GHSA-fp5j-3fpf-mhj5","Warn: Project is vulnerable to: PYSEC-2020-95 / GHSA-wgx7-jwwm-cgjv","Warn: Project is vulnerable to: PYSEC-2023-72","Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6","Warn: Project is vulnerable to: GHSA-79v4-65xg-pq4g","Warn: Project is vulnerable to: GHSA-h4gh-qq45-vh27","Warn: Project is vulnerable to: PYSEC-2024-60 / GHSA-jjg7-2v4v-x38h","Warn: Project is vulnerable to: PYSEC-2022-42969","Warn: Project is vulnerable to: PYSEC-2023-117 / GHSA-mrwq-x4v8-fh7p","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2025-49 / GHSA-5rjg-fvgr-3xxf","Warn: Project is vulnerable to: GHSA-g7vv-2v7x-gj9p","Warn: Project is vulnerable to: GHSA-34jh-p97f-mpxf","Warn: Project is vulnerable to: PYSEC-2023-212 / GHSA-g4mx-q9vg-27p4","Warn: Project is vulnerable to: GHSA-pq67-6m6q-mj2v","Warn: Project is vulnerable to: PYSEC-2023-192 / GHSA-v845-jxx5-vc9f","Warn: Project is vulnerable to: PYSEC-2024-187 / GHSA-rqc4-2hc7-8c8v","Warn: Project is vulnerable to: GHSA-jfmj-5v4g-7637"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 29 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-17T02:13:46.255Z","repository_id":37545664,"created_at":"2025-08-17T02:13:46.255Z","updated_at":"2025-08-17T02:13:46.255Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28478047,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T06:30:42.265Z","status":"ssl_error","status_checked_at":"2026-01-16T06:30:16.248Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:01:21.681Z","updated_at":"2026-01-16T07:31:48.161Z","avatar_url":"https://github.com/awslabs.png","language":"Jupyter Notebook","readme":"# PyDeequ\n\nPyDeequ is a Python API for [Deequ](https://github.com/awslabs/deequ), a library built on top of Apache Spark for defining \"unit tests for data\", which measure data quality in large datasets. PyDeequ is written to support usage of Deequ in Python.\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Coverage](https://img.shields.io/badge/coverage-90%25-green)\n\nThere are 4 main components of Deequ, and they are:\n- Metrics Computation:\n    - `Profiles` leverages Analyzers to analyze each column of a dataset.\n    - `Analyzers` serve here as a foundational module that computes metrics for data profiling and validation at scale.\n- Constraint Suggestion:\n    - Specify rules for various groups of Analyzers to be run over a dataset to return back a collection of constraints suggested to run in a Verification Suite.\n- Constraint Verification:\n    - Perform data validation on a dataset with respect to various constraints set by you.   \n- Metrics Repository\n    - Allows for persistence and tracking of Deequ runs over time.\n\n![](imgs/pydeequ_architecture.jpg)\n\n## 🎉 Announcements 🎉\n- **NEW!!!** The 1.4.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release adds support for Spark 3.5.0.\n- The latest version of Deequ, 2.0.7, is made available With Python Deequ 1.3.0.\n- 1.1.0 release of Python Deequ has been published to PYPI https://pypi.org/project/pydeequ/. This release brings many recent upgrades including support up to Spark 3.3.0! Any feedbacks are welcome through github issues.\n- With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable `SPARK_VERSION` to specify your Spark version! \n- We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: [Monitor data quality in your data lake using PyDeequ and AWS Glue](https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/).\n- Check out the [PyDeequ Release Announcement Blogpost](https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/) with a tutorial walkthrough the Amazon Reviews dataset!\n- Join the PyDeequ community on [PyDeequ Slack](https://join.slack.com/t/pydeequ/shared_invite/zt-te6bntpu-yaqPy7bhiN8Lu0NxpZs47Q) to chat with the devs!\n\n## Quickstart\n\nThe following will quickstart you with some basic usage. For more in-depth examples, take a look in the [`tutorials/`](tutorials/) directory for executable Jupyter notebooks of each module. For documentation on supported interfaces, view the [`documentation`](https://pydeequ.readthedocs.io/).\n\n### Installation\n\nYou can install [PyDeequ via pip](https://pypi.org/project/pydeequ/).\n\n```\npip install pydeequ\n```\n\n### Set up a PySpark session\n```python\nfrom pyspark.sql import SparkSession, Row\nimport pydeequ\n\nspark = (SparkSession\n    .builder\n    .config(\"spark.jars.packages\", pydeequ.deequ_maven_coord)\n    .config(\"spark.jars.excludes\", pydeequ.f2j_maven_coord)\n    .getOrCreate())\n\ndf = spark.sparkContext.parallelize([\n            Row(a=\"foo\", b=1, c=5),\n            Row(a=\"bar\", b=2, c=6),\n            Row(a=\"baz\", b=3, c=None)]).toDF()\n```\n\n### Analyzers\n\n```python\nfrom pydeequ.analyzers import *\n\nanalysisResult = AnalysisRunner(spark) \\\n                    .onData(df) \\\n                    .addAnalyzer(Size()) \\\n                    .addAnalyzer(Completeness(\"b\")) \\\n                    .run()\n\nanalysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)\nanalysisResult_df.show()\n```\n\n### Profile\n\n```python\nfrom pydeequ.profiles import *\n\nresult = ColumnProfilerRunner(spark) \\\n    .onData(df) \\\n    .run()\n\nfor col, profile in result.profiles.items():\n    print(profile)\n```\n\n### Constraint Suggestions\n\n```python\nfrom pydeequ.suggestions import *\n\nsuggestionResult = ConstraintSuggestionRunner(spark) \\\n             .onData(df) \\\n             .addConstraintRule(DEFAULT()) \\\n             .run()\n\n# Constraint Suggestions in JSON format\nprint(suggestionResult)\n```\n\n### Constraint Verification\n\n```python\nfrom pydeequ.checks import *\nfrom pydeequ.verification import *\n\ncheck = Check(spark, CheckLevel.Warning, \"Review Check\")\n\ncheckResult = VerificationSuite(spark) \\\n    .onData(df) \\\n    .addCheck(\n        check.hasSize(lambda x: x \u003e= 3) \\\n        .hasMin(\"b\", lambda x: x == 0) \\\n        .isComplete(\"c\")  \\\n        .isUnique(\"a\")  \\\n        .isContainedIn(\"a\", [\"foo\", \"bar\", \"baz\"]) \\\n        .isNonNegative(\"b\")) \\\n    .run()\n\ncheckResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)\ncheckResult_df.show()\n```\n\n### Repository\n\nSave to a Metrics Repository by adding the `useRepository()` and `saveOrAppendResult()` calls to your Analysis Runner.\n```python\nfrom pydeequ.repository import *\nfrom pydeequ.analyzers import *\n\nmetrics_file = FileSystemMetricsRepository.helper_metrics_file(spark, 'metrics.json')\nrepository = FileSystemMetricsRepository(spark, metrics_file)\nkey_tags = {'tag': 'pydeequ hello world'}\nresultKey = ResultKey(spark, ResultKey.current_milli_time(), key_tags)\n\nanalysisResult = AnalysisRunner(spark) \\\n    .onData(df) \\\n    .addAnalyzer(ApproxCountDistinct('b')) \\\n    .useRepository(repository) \\\n    .saveOrAppendResult(resultKey) \\\n    .run()\n```\n\nTo load previous runs, use the `repository` object to load previous results back in.\n\n```python\nresult_metrep_df = repository.load() \\\n    .before(ResultKey.current_milli_time()) \\\n    .forAnalyzers([ApproxCountDistinct('b')]) \\\n    .getSuccessMetricsAsDataFrame()\n```\n\n### Wrapping up\n\nAfter you've ran your jobs with PyDeequ, be sure to shut down your Spark session to prevent any hanging processes. \n\n```python\nspark.sparkContext._gateway.shutdown_callback_server()\nspark.stop()\n```\n\n## [Contributing](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md)\nPlease refer to the [contributing doc](https://github.com/awslabs/python-deequ/blob/master/CONTRIBUTING.md) for how to contribute to PyDeequ.\n\n## [License](https://github.com/awslabs/python-deequ/blob/master/LICENSE)\n\nThis library is licensed under the Apache 2.0 License.\n\n******\n\n## Contributing Developer Setup\n\n1. Setup [SDKMAN](#setup-sdkman)\n1. Setup [Java](#setup-java)\n1. Setup [Apache Spark](#setup-apache-spark)\n1. Install [Poetry](#poetry)\n1. Run [tests locally](#running-tests-locally)\n\n### Setup SDKMAN\n\nSDKMAN is a tool for managing parallel Versions of multiple Software Development Kits on any Unix based\nsystem. It provides a convenient command line interface for installing, switching, removing and listing\nCandidates. SDKMAN! installs smoothly on Mac OSX, Linux, WSL, Cygwin, etc... Support Bash and ZSH shells. See\ndocumentation on the [SDKMAN! website](https://sdkman.io).\n\nOpen your favourite terminal and enter the following:\n\n```bash\n$ curl -s https://get.sdkman.io | bash\nIf the environment needs tweaking for SDKMAN to be installed,\nthe installer will prompt you accordingly and ask you to restart.\n\nNext, open a new terminal or enter:\n\n$ source \"$HOME/.sdkman/bin/sdkman-init.sh\"\n\nLastly, run the following code snippet to ensure that installation succeeded:\n\n$ sdk version\n```\n\n### Setup Java\n\nInstall Java Now open favourite terminal and enter the following:\n\n```bash\nList the AdoptOpenJDK OpenJDK versions\n$ sdk list java\n\nTo install For Java 11\n$ sdk install java 11.0.10.hs-adpt\n\nTo install For Java 11\n$ sdk install java 8.0.292.hs-adpt\n```\n\n### Setup Apache Spark\n\nInstall Java Now open favourite terminal and enter the following:\n\n```bash\nList the Apache Spark versions:\n$ sdk list spark\n\nTo install For Spark 3\n$ sdk install spark 3.0.2\n```\n\n### Poetry\n\nPoetry [Commands](https://python-poetry.org/docs/cli/#search)\n\n```bash\npoetry install\n\npoetry update\n\n# --tree: List the dependencies as a tree.\n# --latest (-l): Show the latest version.\n# --outdated (-o): Show the latest version but only for packages that are outdated.\npoetry show -o\n```\n\n## Running Tests Locally\n\nTake a look at tests in `tests/dataquality` and `tests/jobs`\n\n```bash\n$ poetry run pytest\n```\n\n## Running Tests Locally (Docker)\n\nIf you have issues installing the dependencies listed above, another way to run the tests and verify your changes is through Docker. There is a Dockerfile that will install the required dependencies and run the tests in a container.\n\n```\ndocker build . -t spark-3.3-docker-test\ndocker run spark-3.3-docker-test\n```\n\n","funding_links":[],"categories":["📊 Data Validation \u0026 Quality","Packages","Traditional Data","Jupyter Notebook","Table of Contents"],"sub_categories":["Data quality","Tools \u0026 Projects","Frameworks and Libraries"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fpython-deequ","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fawslabs%2Fpython-deequ","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fpython-deequ/lists"}