{"id":13571415,"url":"https://github.com/awslabs/deequ","last_synced_at":"2025-05-13T16:10:21.368Z","repository":{"id":38255091,"uuid":"143925946","full_name":"awslabs/deequ","owner":"awslabs","description":"Deequ is a library built on top of Apache Spark for defining \"unit tests for data\", which measure data quality in large datasets.","archived":false,"fork":false,"pushed_at":"2025-03-27T17:06:38.000Z","size":72797,"stargazers_count":3390,"open_issues_count":157,"forks_count":550,"subscribers_count":77,"default_branch":"master","last_synced_at":"2025-04-01T22:19:44.724Z","etag":null,"topics":["dataquality","scala","spark","unit-testing"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/awslabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-08-07T20:55:14.000Z","updated_at":"2025-04-01T09:27:11.000Z","dependencies_parsed_at":"2023-10-24T22:23:18.844Z","dependency_job_id":"2b748619-f12e-4eab-9f91-62709c7df9d1","html_url":"https://github.com/awslabs/deequ","commit_stats":{"total_commits":242,"total_committers":74,"mean_commits":3.27027027027027,"dds":0.8512396694214877,"last_synced_commit":"4544616b3d8c8f2b212e69e2ed99da2d3eff8f70"},"previous_names":[],"tags_count":30,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdeequ","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdeequ/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdeequ/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/awslabs%2Fdeequ/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/awslabs","download_url":"https://codeload.github.com/awslabs/deequ/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247941704,"owners_count":21022038,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataquality","scala","spark","unit-testing"],"created_at":"2024-08-01T14:01:01.845Z","updated_at":"2025-04-08T23:16:40.186Z","avatar_url":"https://github.com/awslabs.png","language":"Scala","readme":"# Deequ - Unit Tests for Data\n[![GitHub license](https://img.shields.io/github/license/awslabs/deequ.svg)](https://github.com/awslabs/deequ/blob/master/LICENSE)\n[![GitHub issues](https://img.shields.io/github/issues/awslabs/deequ.svg)](https://github.com/awslabs/deequ/issues)\n[![Build Status](https://travis-ci.com/awslabs/deequ.svg?branch=master)](https://travis-ci.com/awslabs/deequ)\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/com.amazon.deequ/deequ/badge.svg)](https://maven-badges.herokuapp.com/maven-central/com.amazon.deequ/deequ)\n\nDeequ is a library built on top of Apache Spark for defining \"unit tests for data\", which measure data quality in large datasets. We are happy to receive feedback and [contributions](CONTRIBUTING.md).\n\nPython users may also be interested in PyDeequ, a Python interface for Deequ. You can find PyDeequ on [GitHub](https://github.com/awslabs/python-deequ), [readthedocs](https://pydeequ.readthedocs.io/en/latest/README.html), and [PyPI](https://pypi.org/project/pydeequ/).\n\n## Requirements and Installation\n\n__Deequ__ depends on Java 8. Deequ version 2.x only runs with Spark 3.1, and vice versa. If you rely on a previous Spark version, please use a Deequ 1.x version (legacy version is maintained in legacy-spark-3.0 branch). We provide legacy releases compatible with Apache Spark versions 2.2.x to 3.0.x. The Spark 2.2.x and 2.3.x releases depend on Scala 2.11 and the Spark 2.4.x, 3.0.x, and 3.1.x releases depend on Scala 2.12. \n\nAvailable via [maven central](http://mvnrepository.com/artifact/com.amazon.deequ/deequ). \n\nChoose the latest release that matches your Spark version from the [available versions](https://repo1.maven.org/maven2/com/amazon/deequ/deequ/). Add the release as a dependency to your project. For example, for Spark 3.1.x:\n\n__Maven__\n```\n\u003cdependency\u003e\n  \u003cgroupId\u003ecom.amazon.deequ\u003c/groupId\u003e\n  \u003cartifactId\u003edeequ\u003c/artifactId\u003e\n  \u003cversion\u003e2.0.0-spark-3.1\u003c/version\u003e\n\u003c/dependency\u003e\n```\n__sbt__\n```\nlibraryDependencies += \"com.amazon.deequ\" % \"deequ\" % \"2.0.0-spark-3.1\"\n```\n\n## Example\n\n__Deequ__'s purpose is to \"unit-test\" data to find errors early, before the data gets fed to consuming systems or machine learning algorithms. In the following, we will walk you through a toy example to showcase the most basic usage of our library. An executable version of the example is available [here](/src/main/scala/com/amazon/deequ/examples/BasicExample.scala).\n\n__Deequ__ works on tabular data, e.g., CSV files, database tables, logs, flattened json files, basically anything that you can fit into a Spark dataframe. For this example, we assume that we work on some kind of `Item` data, where every item has an id, a productName, a description, a priority and a count of how often it has been viewed.\n\n```scala\ncase class Item(\n  id: Long,\n  productName: String,\n  description: String,\n  priority: String,\n  numViews: Long\n)\n```\n\nOur library is built on [Apache Spark](https://spark.apache.org/) and is designed to work with very large datasets (think billions of rows) that typically live in a distributed filesystem or a data warehouse. For the sake of simplicity in this example, we just generate a few toy records though.\n\n```scala\nval rdd = spark.sparkContext.parallelize(Seq(\n  Item(1, \"Thingy A\", \"awesome thing.\", \"high\", 0),\n  Item(2, \"Thingy B\", \"available at http://thingb.com\", null, 0),\n  Item(3, null, null, \"low\", 5),\n  Item(4, \"Thingy D\", \"checkout https://thingd.ca\", \"low\", 10),\n  Item(5, \"Thingy E\", null, \"high\", 12)))\n\nval data = spark.createDataFrame(rdd)\n```\n\nMost applications that work with data have implicit assumptions about that data, e.g., that attributes have certain types, do not contain NULL values, and so on. If these assumptions are violated, your application might crash or produce wrong outputs. The idea behind __deequ__ is to explicitly state these assumptions in the form of a \"unit-test\" for data, which can be verified on a piece of data at hand. If the data has errors, we can \"quarantine\" and fix it, before we feed it to an application.\n\nThe main entry point for defining how you expect your data to look is the [VerificationSuite](src/main/scala/com/amazon/deequ/VerificationSuite.scala) from which you can add [Checks](src/main/scala/com/amazon/deequ/checks/Check.scala) that define constraints on attributes of the data. In this example, we test for the following properties of our data:\n\n  * there are 5 rows in total\n  * values of the `id` attribute are never NULL and unique\n  * values of the `productName` attribute are never NULL\n  * the `priority` attribute can only contain \"high\" or \"low\" as value\n  * `numViews` should not contain negative values\n  * at least half of the values in `description` should contain a url\n  * the median of `numViews` should be less than or equal to 10\n\nIn code this looks as follows:\n\n```scala\nimport com.amazon.deequ.VerificationSuite\nimport com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}\n\n\nval verificationResult = VerificationSuite()\n  .onData(data)\n  .addCheck(\n    Check(CheckLevel.Error, \"unit testing my data\")\n      .hasSize(_ == 5) // we expect 5 rows\n      .isComplete(\"id\") // should never be NULL\n      .isUnique(\"id\") // should not contain duplicates\n      .isComplete(\"productName\") // should never be NULL\n      // should only contain the values \"high\" and \"low\"\n      .isContainedIn(\"priority\", Array(\"high\", \"low\"))\n      .isNonNegative(\"numViews\") // should not contain negative values\n      // at least half of the descriptions should contain a url\n      .containsURL(\"description\", _ \u003e= 0.5)\n      // half of the items should have less than 10 views\n      .hasApproxQuantile(\"numViews\", 0.5, _ \u003c= 10))\n    .run()\n```\n\nAfter calling `run`, __deequ__ translates your test to a series of Spark jobs, which it executes to compute metrics on the data. Afterwards it invokes your assertion functions (e.g., `_ == 5` for the size check) on these metrics to see if the constraints hold on the data. We can inspect the [VerificationResult](src/main/scala/com/amazon/deequ/VerificationResult.scala) to see if the test found errors:\n\n```scala\nimport com.amazon.deequ.constraints.ConstraintStatus\n\n\nif (verificationResult.status == CheckStatus.Success) {\n  println(\"The data passed the test, everything is fine!\")\n} else {\n  println(\"We found errors in the data:\\n\")\n\n  val resultsForAllConstraints = verificationResult.checkResults\n    .flatMap { case (_, checkResult) =\u003e checkResult.constraintResults }\n\n  resultsForAllConstraints\n    .filter { _.status != ConstraintStatus.Success }\n    .foreach { result =\u003e println(s\"${result.constraint}: ${result.message.get}\") }\n}\n```\n\nIf we run the example, we get the following output:\n```\nWe found errors in the data:\n\nCompletenessConstraint(Completeness(productName)): Value: 0.8 does not meet the requirement!\nPatternConstraint(containsURL(description)): Value: 0.4 does not meet the requirement!\n```\nThe test found that our assumptions are violated! Only 4 out of 5 (80%) of the values of the `productName` attribute are non-null and only 2 out of 5 (40%) values of the `description` attribute did contain a url. Fortunately, we ran a test and found the errors, somebody should immediately fix the data :)\n\n## More examples\n\nOur library contains much more functionality than what we showed in the basic example. We are in the process of adding [more examples](src/main/scala/com/amazon/deequ/examples/) for its advanced features. So far, we showcase the following functionality:\n\n * [Persistence and querying of computed metrics of the data with a MetricsRepository](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/metrics_repository_example.md)\n * [Data profiling](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/data_profiling_example.md) of large data sets\n * [Anomaly detection](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/anomaly_detection_example.md) on data quality metrics over time\n * [Automatic suggestion of constraints](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/constraint_suggestion_example.md) for large datasets\n * [Incremental metrics computation on growing data and metric updates on partitioned data](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/algebraic_states_example.md) (advanced)\n\n\n## Citation\n\nIf you would like to reference this package in a research paper, please cite:\n\nSebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. [Automating large-scale data quality verification](http://www.vldb.org/pvldb/vol11/p1781-schelter.pdf). Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.\n\n## License\n\nThis library is licensed under the Apache 2.0 License.\n","funding_links":[],"categories":["📊 Data Validation \u0026 Quality","Scala","Traditional Data","Data Quality","Tools","分布式机器学习","Industry-strength Anomaly Detection","Packages"],"sub_categories":["Tools \u0026 Projects","Open Source Tools","Data quality"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fdeequ","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fawslabs%2Fdeequ","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fawslabs%2Fdeequ/lists"}