{"id":13571358,"url":"https://github.com/target/data-validator","last_synced_at":"2025-05-13T17:58:38.515Z","repository":{"id":35825317,"uuid":"181898630","full_name":"target/data-validator","owner":"target","description":"A tool to validate data, built around Apache Spark. ","archived":false,"fork":false,"pushed_at":"2025-04-01T09:06:43.000Z","size":602,"stargazers_count":101,"open_issues_count":25,"forks_count":33,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-04-04T08:40:35.407Z","etag":null,"topics":["data-science","data-validation","hacktoberfest"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/target.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-04-17T13:34:23.000Z","updated_at":"2025-03-28T16:34:19.000Z","dependencies_parsed_at":"2024-04-10T22:41:05.195Z","dependency_job_id":"48c65149-282e-48ab-8bf1-527b984f9a36","html_url":"https://github.com/target/data-validator","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/target%2Fdata-validator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/target%2Fdata-validator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/target%2Fdata-validator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/target%2Fdata-validator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/target","download_url":"https://codeload.github.com/target/data-validator/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253999824,"owners_count":21997336,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","data-validation","hacktoberfest"],"created_at":"2024-08-01T14:01:01.324Z","updated_at":"2025-05-13T17:58:38.481Z","avatar_url":"https://github.com/target.png","language":"Scala","funding_links":[],"categories":["Scala"],"sub_categories":[],"readme":"# Data Validator\n\n![GitHub release (latest by date)](https://img.shields.io/github/v/release/target/data-validator?style=plastic)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg?style=plastic)](https://opensource.org/licenses/Apache-2.0)\n[![Continuous Integration](https://github.com/target/data-validator/actions/workflows/ci.yaml/badge.svg)](https://github.com/target/data-validator/actions/workflows/ci.yaml)\n[![Release Build Status](https://github.com/target/data-validator/actions/workflows/release.yaml/badge.svg)](https://github.com/target/data-validator/actions/workflows/release.yaml)\n\nA tool to validate data in Spark\n\n## Usage\n\n### Retrieving official releases via direct download or Maven-compatible dependency retrieval, e.g. `spark-submit`\n\nYou can make the jars available in one of two ways for the [example run invocations below](#example-run):\n\n1. Get the latest version from [GitHub Packages](https://github.com/orgs/target/packages?repo_name=data-validator)\n   for the project. Place the jars somewhere and pass their path to `--jars` when running `spark-submit`.\n\n1. You can pull in the dependency using `spark-submit`'s `--repositories`, `--packages`, and `--mainClass`\n   options, but it requires setting `spark.jars.ivySettings` and providing this file, populated with a valid\n   [personal access token](https://github.com/settings/tokens) having the `read:packages` scope enabled.\n   N.b. it can be a challenge to secure this file on shared clusters; consider using a public GitHub\n   service account instead of a token from your own personal GitHub account.\n\n    ```xml\n    \u003civysettings\u003e\n      \u003csettings defaultResolver=\"thechain\"\u003e\n        \u003ccredentials host=\"maven.pkg.github.com\" realm=\"GitHub Package Registry\"\n                     username=\"${GITHUB_PACKAGES_USER}\" passwd=\"${GITHUB_PACKAGES_USER_TOKEN}\" /\u003e\n      \u003c/settings\u003e\n      \u003cresolvers\u003e\n        \u003cchain name=\"thechain\"\u003e\n          \u003cibiblio name=\"central\" m2compatible=\"true\" root=\"https://repo1.maven.org/maven2\" /\u003e\n          \u003c!-- add any other repositories here --\u003e\n          \u003cibiblio name=\"ghp-dv\" m2compatible=\"true\" root=\"https://maven.pkg.github.com/target/data-validator\"/\u003e\n        \u003c/chain\u003e\n      \u003c/resolvers\u003e\n    \u003c/ivysettings\u003e\n    ```\n    See also [How do I add a GitHub Package repository when executing spark-submit --repositories?](https://stackoverflow.com/q/70687667/204052)\n\n\n### Building locally\n\nSee [CONTRIBUTING](CONTRIBUTING.md) for development environment setup.\n\nAssemble fat jar: `make build` or `sbt clean assembly`\n\n```\nspark-submit --master local data-validator-assembly-0.14.1.jar --help\n\ndata-validator v0.14.1\nUsage: data-validator [options]\n\n  --version\n  --verbose                Print additional debug output.\n  --config \u003cvalue\u003e         required validator config .yaml filename, prefix w/ 'classpath:' to load configuration from JVM classpath/resources, ex. '--config classpath:/config.yaml'\n  --jsonReport \u003cvalue\u003e     optional JSON report filename\n  --htmlReport \u003cvalue\u003e     optional HTML report filename\n  --vars k1=v1,k2=v2...    other arguments\n  --exitErrorOnFail true|false\n                           optional when true, if validator fails, call System.exit(-1) Defaults to True, but will change to False in future version.\n  --emailOnPass true|false\n                           optional when true, sends email on validation success. Default: false\n  --help                   Show this help message and exit.\n```\n\nIf you want to build with Java 11 or newer, set the \"MODERN_JAVA\" environment variable.\nThis may become the default in the future.\n\n## Example Run\n\nWith the JAR directly:\n\n```bash\nspark-submit \\\n  --num-executors 10 \\\n  --executor-cores 2 \\\n  data-validator-assembly-0.14.1.jar \\\n  --config config.yaml \\\n  --jsonReport report.json\n```\n\nUsing packages loading, having created `dv-ivy.xml` as suggested above\nand having replaced the placeholders in the example:\n\n```bash\ntouch empty.file \u0026\u0026 \\\nspark-submit \\\n  --class com.target.data_validator.Main \\\n  --packages com.target:data-validator_2.11:0.14.1 \\\n  --conf spark.jars.ivySettings=$(pwd)/dv-ivy.xml \\ \n  empty.file \\\n  --config config.yaml \\\n  --jsonReport report.json\n```\n\nSee the [Example Config](#example-config) below for the contents of `config.yaml`.\n\n## Config file Description\n\nThe data-validator config file is yaml based and it has 3 sections,\nGlobal Settings, Table Sources, and Validators.  The Table Sources,\nand Validators have the ability to use variables in the\nconfiguration. These variables are replaced at runtime with the values\nset via `Global Settings` section or the `--vars` option on the\ncommand line.  Variables start with `$` and must contain a word\nstarting with a letter (A-Za-z) and followed by zero or more letters\n(A-Za-z), numbers(0-9), or underscore. Variables can optionally be\nwrapped in `{` `}`. i.e. `$foo`, `${foo}` See the\n[code](src/main/scala/com/target/data_validator/VarSubstitution.scala#L141)\nfor the regular expression used to find them in a string. All the\ntable sources, and all but one validator (`rowCount`) supports\nvariables in their configuration parameters. **Note:** Care must be taken\nfor some of the substitutions, some possible values might require\nquoting the variables in the config.\n\n### Global Settings\n\nThe first section is the global settings that are used\nthroughout the program.\n\n| Variable            | Type        | Required |                                           Description                                            |\n|:--------------------|:------------|:---------|:------------------------------------------------------------------------------------------------:|\n| `numKeyCols`        | Int         | Yes      |   The number of columns from the table schema to use to uniquely identify a row in the table.    |\n| `numErrorsToReport` | Int         | Yes      |                  The number of detailed errors to include in Validator Report.                   |\n| `detailedErrors`    | Boolean     | Yes      |      If a check fails, run a second pass and gather `numErrorToReport` examples of failure.      |\n| `email`             | EmailConfig | No       |                                See [Email Config](#email-config).                                |\n| `vars`              | Map         | No       | A map of (key, value) pairs used for variable substitution in `tables` config. See next section. |\n| `outputs`           | Array       | No       |        Describes where to send `.json` report. See [Validator Output](#validator-output).        |\n| `tables`            | List        | Yes      |                      List of table sources used to load tables to validate.                      |\n\n#### Email Config\n\n| Variable   | Type          | Required |                                   Description                                   |\n|:-----------|:--------------|:---------|:-------------------------------------------------------------------------------:|\n| `smtpHost` | String        | Yes      |                  The smtp host to send email message through.                   |\n| `subject`  | String        | Yes      |                           Subject for email message.                            |\n| `from`     | String        | Yes      |                Email address to appear in from part of message.                 |\n| `to`       | Array[String] | Yes      |      Must specify at least one email address to send the email report to.       |\n| `cc`       | Array[String] | No       | Optional list of email addresses to send message to via `cc` field in message.  |\n| `bcc`      | Array[String] | No       | Optional list of email addresses to send message to via `bcc` field in message. |\n\nNote that Data Validator only sends email on _failure_ by default. To send email even on successful runs,\npass `--emailOnPass true` to the command line.\n\n#### Defining Variables\n\nThere are 4 different types of variables that you can specify, simple, environment, shell and SQL.\n\n##### Simple Variable\n\nSimple variables are specified by the `name` and `value` pairs and are very straight forward.\n\n```yaml\nvars:\n  - name: ENV\n    value: prod\n```\n\nThis sets the variable `ENV` to the value `prod`\n\n##### Environment Variable\n\nEnvironment variables import the value from the [operating system](https://docs.oracle.com/javase/tutorial/essential/environment/env.html)\n\n```yaml\nvars:\n  - name: JAVA_DIR\n    env: JAVA_HOME\n```\n\nThis will set the variable `JAVA_DIR` to the value returned by the `System.getenv(\"JAVA_HOME\")`\nIf `JAVA_HOME` does not exist in the system environment, the data-validator will stop processing and exit with an error.\n\n##### Shell Variable\n\nShell variable will take the first line of output from a shell command and store it a variable.\n\n```yaml\nvars:\n  - name: NEXT_SATURDAY\n    shell: date -d \"next saturday\" +\"%Y-%m-%d\"\n```\n\nThis will set the variable `NEXT_SATURDAY` to the first line of output from the shell command `date -d \"next saturday\" +\"%Y-%m-%d\"`.\n\n##### SQL Variable\n\nSQL variable will take the first column from the first row of the results from a Spark SQL statement.\n\n```yaml\nvars:\n  - name: MAX_AGE\n    sql: select max(age) from census_income.adult\n```\n\nThis runs the sql command that gets the max value from the column `age` from the table `adult` in the `census_income` database and stores it in `MAX_AGE`.\n\n### Validator Output\n\nIn addition to the `--jsonReport` command line option, the `.yaml` has a `outputs` section that directs the .json event report to a file or pipes it to a program. There is no current limit on the number of outputs.\n\n#### Filename\n\n```yaml\noutputs:\n  - filename: /user/home/sample.json\n    append: true\n```\n\nIf the `filename` specified begins with a `/` or `local:///` it is written to the local filesystem. If the filename begins with `hdfs://` the report is written to the hdfs path. An optional `append` boolean can be specified, and if it is `true` the current report will be appended to the end of the specified file. The default is `append: false` and the filename is overwritten. The `filename` supports variable substitution, the optional `append` does not. Before the validator starts processing tables, it checks to verify that it can create or append to the `filename`, if it cannot, the data validator will exit with an error (non-zero value).\n\n#### Pipe\n\n```yaml\noutputs:\n  - pipe: /path/to/program\n    ignoreError: true\n```\n\nA `pipe` is used to send the `.json` event report to another program for processing. This is a very powerful feature, and can enable the data-validator to be integrated with virtually any other system. An optional `ignoreError` boolean can also be specified, if `true` the exit value of the program will be ignored. If `false` (default) and the program exits with a non-zero status, the data-validator will fail.  The `pipe` supports variable substitution, the optional `ignoreError` does not.\n\nBefore the validator starts processing tables, it checks to see if the `pipe` program is executable, if it is not, the data-validator will exit with an error (non-zero value). The program must be on a local filesystem to be executed.\n\n### Table Sources\n\nTable sources are used to specify how to load the tables to be\nvalidated. Currently supported sources are HiveTable, and\nOrcFile. Each table source has 3 common arguments, `keyColumns`, `condition`,\n`checks`, and its own source specific argument(s). The `keyColumns`\nare list of columns that can be used to uniquely identify a row in the\ntable for the detailed error report when a validator fails. The `condition`\nenables the user to specify a snippet of sql to pass to the where clause.\n  The `checks` argument is a list of validators to run on this table.\n\n#### HiveTable\n\nTo validate a Hive table, specify the `db` and the `table`, see below.\n\n```yaml\n- db: $DB\n  table: table_name\n  condition: \"col1 \u003c 100\"\n  keyColumns:\n    - col1\n    - col2\n  checks:\n```\n\n#### OrcFile\n\nTo validate an `.orc` file, specify `orcFile` and the path to the file, see below.\n\n```yaml\n- orcFile: /path/to/orc/file\n  keyColumns:\n    - col1\n    - col2\n  checks:\n```\n\n#### Parquet File\n\nTo validate an `.parquet` file, specify `parquetFile` and the path to the file, see below.\n\n```yaml\n- parquetFile: /path/to/parquet/file\n  keyColumns:\n    - col1\n    - col2\n  checks:\n```\n\n#### Core `spark.read` fluent API specified format loader\n\nTo validate data loadable by the Spark DataFrameReader Fluent API, use something like this:\n\n```yaml\n  # Some systems require a special format\n  format: llama\n  # You can also pass any valid options\n  options:\n    maxMemory: 8G\n  # This is a string passed to the varargs version of DataFrameReader.load(String*)\n  # If omitted, then DV will call DataFrameReader.load() without parameters.\n  # The DataSource that Spark loads is expected to know how to handle this.\n  loadData:\n    - /path/to/something/camelid.llama\n  keyColumns:\n    - col1\n    - col2\n  condition: \"col1 \u003c 100\"\n  checks:\n```\n\nUnder the hood the above would be like loaded a DataFrame with:\n\n```scala\nspark.read\n  .format(\"llama\")\n  .option(\"maxMemory\", \"8G\")\n  .load(\"/path/to/something/camelid.llama\")\n```\n\n### Validators\n\nThe third section are the validators. To specify a validator, you\nfirst specify the type as one of the validators, then specify the\narguments for that validator. Some of the validators support an error\nthreshold. This options allows the user to specify the number of errors\nor percentage of errors they can tolerate.  In some use cases, it\nmight not be possible to eliminate all errors in the data.\n\n##### Thresholds\n\nThresholds can be specified as an absolute number of errors, or a percentage of the row count.\nIf the threshold is `\u003e= 1` it is considered an absolute number of errors. For example `1000` would fail the check if there are more then 1000 rows that failed the check.\n\nIf the threshold is `\u003c 1` it is considered a fraction of the row count. For example `0.25` would fail the check if more then `rowCount * 0.25` of the rows fail the check.\nIf the threshold ends in a `%` its considered a percentage of the row count. For eample `33%` would fail the check if more then `rowCount * 0.33` of the rows fail the check.\n\nCurrently supported validators are listed below:\n\n#### `columnMaxCheck`\n\nTakes 2 parameters, the column name and a `value`. The check will fail if `max(column)` is **not equal** to the value.\n\n| Arg      | Type   | Description                                                                                                                                                                                            |\n|----------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `column` | String | Column within table to find the max from.                                                                                                                                                              |\n| `value`  | \\*     | The column max should equal this value or the check will fail.  **Note:** The type of the value should match the type of the column. If the column is a `NumericType`, the value cannot be a `String`. |\n\n#### `negativeCheck`\n\nTakes a single parameter, the column name to check. The validator will fail if any rows with that column are negative.\n\n| Arg         | Type   | Description                                                                                                                                                                                |\n|-------------|--------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `column`    | String | Table column to be checked for negative values.  If it contains a `null` validator will fail.  **Note:** Column must be of a `NumericType` or the check will fail during the config check. |\n| `threshold` | String | See above description of threshold.                                                                                                                                                        |\n\n#### `nullCheck`\n\nTakes a single parameter, the column name to check. The validator will fail if any rows with that column are `null`.\n\n| Arg         | Type   | Description                                                                          |\n|-------------|--------|--------------------------------------------------------------------------------------|\n| `column`    | String | Table column to be checked for `null`.  If it contains a `null` validator will fail. |\n| `threshold` | String | See above description of threshold.                                                  |\n\n#### `rangeCheck`\n\nTakes 2 - 4 parameters, described below. If the value in the column doesn't fall within the range specified by (`minValue`, `maxValue`) the check will fail.\n\n| Arg         | Type    | Description                                                                                   |\n|-------------|---------|-----------------------------------------------------------------------------------------------|\n| `column`    | String  | Table column to be checked.                                                                   |\n| `minValue`  | \\*      | lower bound of the range, or other column in table. Type depends on the type of the `column`. |\n| `maxValue`  | \\*      | upper bound of the range, or other column in table. Type depends on the type of the `column`. |\n| `inclusive` | Boolean | Include `minValue` and `maxValue` as part of the range.                                       |\n| `threshold` | String  | See above description of threshold.                                                           |\n\n**Note:** To specify another column in the table, you must prefix the column name with a **`** (backtick).\n\n#### `stringLengthCheck`\n\nTakes 2 to 4 parameters, described in the table below. If the length of the string in the column doesn't fall within the range specified by (`minLength`, `maxLength`), both inclusive, the check will fail.\nAt least one of `minLength` or `maxLength` must be specified. The data type of `column` must be String.\n\n| Arg         | Type    | Description                                                             |\n|-------------|---------|-------------------------------------------------------------------------|\n| `column`    | String  | Table column to be checked. The DataType of the column must be a String |\n| `minLength` | Integer | Lower bound of the length of the string, inclusive.                     |\n| `maxLength` | Integer | Upper bound of the length of the string, inclusive.                     |\n| `threshold` | String  | See above description of threshold.                                     |\n\n#### `stringRegexCheck`\n\nTakes 2 to 3 parameters, described in the table below. If the `column` value does not match the pattern specified by the `regex`, the check will fail.\nA value for `regex` must be specified. The data type of `column` must be String.\n\n| Arg         | Type   | Description                                                             |\n|-------------|--------|-------------------------------------------------------------------------|\n| `column`    | String | Table column to be checked. The DataType of the column must be a String |\n| `regex`     | String | POSIX regex.                                                            |\n| `threshold` | String | See above description of threshold.                                     |\n\n#### `rowCount`\n\nThe minimum number of rows a table must have to pass the validator.\n\n| Arg          | Type | Description                                           |\n|--------------|------|-------------------------------------------------------|\n| `minNumRows` | Long | The minimum number of rows a table must have to pass. |\n\nSee Example Config file below to see how the checks are configured.\n\n#### `uniqueCheck`\n\nThis check is used to make sure all rows in the table are unique, only the columns specified are used to determine uniqueness.\nThis is a costly check and requires an additional pass through the table.\n\n| Arg       | Type          | Description                                         |\n|-----------|---------------|-----------------------------------------------------|\n| `columns` | Array[String] | Each set of values in these columns must be unique. |\n\n#### `columnSumCheck`\n\nThis check sums a column in all rows. If the sum applied to the `column` doesn't fall within the range specified by (`minValue`, `maxValue`) the check will fail.\n\n| Arg         | Type        | Description                                                            |\n|-------------|-------------|------------------------------------------------------------------------|\n| `column`    | String      | The column to be checked.                                              |\n| `minValue`  | NumericType | The lower bound of the sum.  Type depends on the type of the `column`. |\n| `maxValue`  | NumericType | The upper bound of the sum. Type depends on the type of the `column`.  |\n| `inclusive` | Boolean     | Include `minValue` and `maxValue` as part of the range.                |\n\n**Note:** If bounds are non-inclusive, and the actual sum is equal to one of the bounds, the relative error percentage will be undefined.\n\n#### `colstats`\n\nThis check generates column statistics about the specified column.\n\n| Arg         | Type        | Description                                |\n|-------------|-------------|--------------------------------------------|\n| `column`    | String      | The column on which to collect statistics. |\n\nThese keys and their corresponding values will appear in the check's JSON summary when using the JSON report output mode:\n\n| Key         | Type        | Description                                                                                                             |\n|-------------|-------------|-------------------------------------------------------------------------------------------------------------------------|\n| `count`     | Integer     | Count of non-null entries in the `column`.                                                                              |\n| `mean`      | Double      | Mean/Average of the values in the `column`.                                                                             |\n| `min`       | Double      | Smallest value in the `column`.                                                                                         |\n| `max`       | Double      | Largest value in the `column`.                                                                                          |\n| `stdDev`    | Double      | Standard deviation of the values in the `column`.                                                                       |\n| `histogram` | Complex     | Summary of an equi-width histogram, counts of values appearing in 10 equally sized buckets over the range `[min, max]`. |\n\n## Example Config\n\n```yaml\n---\n\n# If keyColumns are not specified for a table, we take the first N columns of a table instead.\nnumKeyCols: 2\n\n# numErrorsToReport: Number of errors per check show in \"Error Details\" of report, this is to limit the size of the email.\nnumErrorsToReport: 5\n\n# detailedErrors: If true, a second pass will be made for checks that fail to gather numErrorsToReport examples with offending value and keyColumns to aide in debugging\ndetailedErrors: true\n\nvars:\n  - name: ENV\n    value: prod\n\n  - name: JAVA_DIR\n    env: JAVA_HOME\n\n  - name: TODAY\n    shell: date + \"%Y-%m-%d\"\n\n  - name: MAX_AGE\n    sql: SELECT max(age) FROM census_income.adult\n\noutputs:\n  - filename: /user/home/sample.json\n    append: true\n\n  - pipe: /path/to/program\n    ignoreError: true\n\nemail:\n  smtpHost: smtp.example.com\n  subject: Data Validation Summary\n  from: data-validator-no-reply@example.com\n  to:\n    - person1@example.com\n  cc:\n    - person2@example.com, person3@example.com\n  bcc:\n    - person4@example.com\n\ntables:\n  - db: census_income\n    table: adult\n    # Key Columns are used when errors occur to identify a row, so they should include enough columns to uniquely identify a row.\n    keyColumns:\n      - age\n      - occupation\n    condition: educationNum \u003e= 5\n    checks:\n      # rowCount - checks if the number of rows is at least minRows\n      - type: rowCount\n        minNumRows: 50000\n\n      # negativeCheck - checks if any values are less than 0\n      - type: negativeCheck\n        column: age\n      \n      # colstats - adds basic statistics of the column to the output\n      - type: colstats\n        column: age\n        \n      # nullCheck - checks if the column is null, counts number of rows with null for this column.\n      - type: nullCheck\n        column: occupation\n\n      # stringLengthCheck - checks if the length of the string in the column falls within the specified range, counts number of rows in which the length of the string is outside the specified range.\n      - type: stringLengthCheck\n        column: occupation\n        minLength: 1\n        maxLength: 5\n\n      # stringRegexCheck - checks if the string in the column matches the pattern specified by `regex`, counts number of rows in which there is a mismatch.\n      - type: stringRegexCheck\n        column: occupation\n        regex: ^ENGINEER$ # matches the word ENGINEER\n\n      - type: stringRegexCheck\n        column: occupation\n        regex: \\w # matches any alphanumeric string\n```\n\n## Working with OOZIE Workflows\n\nThe data-validator can be used in an oozie workflow to halt the wf if a check doesn't pass. There are 2 ways to use the data-validator in oozie and each has their own drawbacks. The selection of the methods is determined by the `--exitErrorOnFail {true|false}` command line option.\n\n### Setting ExitErrorOnFail to True\n\nThe first option, enabled by `--exitErrorOnFail=true`, is to  have the data-validator exit with a non-zero value when a check fails. This enables the workflow to decide how it wants to handle a failed check/error.  The downsides of this method, is that you can never be sure if the data-validator exited with an error because bad check, or if there was a problem with the execution of the data-validator. This also pollutes the oozie workflow info with `ERROR`, which some might not like. This is currently the default but likely to change with `v1.0.0`.\n\nExample oozie wf snippet:\n\n```xml\n\u003caction name=\"RunDataValidator\"\u003e\n    \u003cshell xmlns=\"uri:oozie:shell-action:0.2\"\u003e\n      \u003cjob-tracker\u003e${jobTracker}\u003c/job-tracker\u003e\n      \u003cname-node\u003e${nameNode}\u003c/name-node\u003e\n      \u003cexec\u003espark-submit\u003c/exec\u003e\n      \u003cargument\u003e--conf\u003c/argument\u003e\n      \u003cargument\u003espark.yarn.maxAppAttempts=1\u003c/argument\u003e\n      \u003cargument\u003e--class\u003c/argument\u003e\n      \u003cargument\u003ecom.target.data_validator.Main\u003c/argument\u003e\n      \u003cargument\u003e--master\u003c/argument\u003e\n      \u003cargument\u003eyarn\u003c/argument\u003e\n      \u003cargument\u003e--deploy-mode\u003c/argument\u003e\n      \u003cargument\u003ecluster\u003c/argument\u003e\n      \u003cargument\u003e--keytab\u003c/argument\u003e\n      \u003cargument\u003e${keytab}\u003c/argument\u003e\n      \u003cargument\u003e--principal\u003c/argument\u003e\n      \u003cargument\u003e${principal}\u003c/argument\u003e\n      \u003cargument\u003e--files\u003c/argument\u003e\n      \u003cargument\u003econfig.yaml\u003c/argument\u003e\n      \u003cargument\u003edata-validator-assembly-0.14.1.jar\u003c/argument\u003e\n      \u003cargument\u003e--config\u003c/argument\u003e\n      \u003cargument\u003econfig.yaml\u003c/argument\u003e\n      \u003cargument\u003e--exitErrorOnFail\u003c/argument\u003e\n      \u003cargument\u003etrue\u003c/argument\u003e\n      \u003cargument\u003e--vars\u003c/argument\u003e\n      \u003cargument\u003eENV=${ENV},EMAIL_REPORT=${EMAIL_REPORT},SMTP_HOST=${SMTP_HOST}\u003c/argument\u003e\n      \u003ccapture-output/\u003e\n    \u003c/shell\u003e\n    \u003cok to=\"ValidatorSuccess\" /\u003e\n    \u003cerror to=\"ValidatorErrorOrCheckFail\" /\u003e\n  \u003c/action\u003e\n\n \u003caction name=\"ValidatorErrorOrCheckFail\"\u003e\n  \u003c!-- Check or data-validator failed  --\u003e\n  \u003c/action\u003e\n\n  \u003caction name=\"ValidatorSuccess\"\u003e\n  \u003c!-- Everything is wonderful!  --\u003e\n  \u003c/action\u003e\n```\n\n\n### Setting ExitErrorOnFail to False\n\nThe second option, enabled by `--exitErrorOnFail=false`, is to have the data-validator output to stdout `DATA_VALIDATOR_STATUS=PASS` or `DATA_VALIDATOR_STATUS=FAIL` and `System.exit(0)` when it completes. This enables the workflow to distinguish between a failed check, and a runtime error.\nThe downside is that you must use the oozie shell action,  with the capture output option, and run the validator via Spark's client mode. This will likely become the default behavior in `v1.0.0`.\n\nExample oozie wf snippet:\n\n```xml\n\u003caction name=\"RunDataValidator\"\u003e\n  \u003cshell xmlns=\"uri:oozie:shell-action:0.2\"\u003e\n    \u003cjob-tracker\u003e${jobTracker}\u003c/job-tracker\u003e\n    \u003cname-node\u003e${nameNode}\u003c/name-node\u003e\n    \u003cexec\u003espark-submit\u003c/exec\u003e\n    \u003cargument\u003e--conf\u003c/argument\u003e\n    \u003cargument\u003espark.yarn.maxAppAttempts=1\u003c/argument\u003e\n    \u003cargument\u003e--class\u003c/argument\u003e\n    \u003cargument\u003ecom.target.data_validator.Main\u003c/argument\u003e\n    \u003cargument\u003e--master\u003c/argument\u003e\n    \u003cargument\u003eyarn\u003c/argument\u003e\n    \u003cargument\u003e--deploy-mode\u003c/argument\u003e\n    \u003cargument\u003eclient\u003c/argument\u003e\n    \u003cargument\u003e--keytab\u003c/argument\u003e\n    \u003cargument\u003e${keytab}\u003c/argument\u003e\n    \u003cargument\u003e--principal\u003c/argument\u003e\n    \u003cargument\u003e${principal}\u003c/argument\u003e\n    \u003cargument\u003edata-validator-assembly-0.14.1.jar\u003c/argument\u003e\n    \u003cargument\u003e--config\u003c/argument\u003e\n    \u003cargument\u003econfig.yaml\u003c/argument\u003e\n    \u003cargument\u003e--exitErrorOnFail\u003c/argument\u003e\n    \u003cargument\u003efalse\u003c/argument\u003e\n    \u003cargument\u003e--vars\u003c/argument\u003e\n    \u003cargument\u003eENV=${ENV},EMAIL_REPORT=${EMAIL_REPORT},SMTP_HOST=${SMTP_HOST}\u003c/argument\u003e\n    \u003ccapture-output/\u003e\n  \u003c/shell\u003e\n  \u003cok to=\"ValidatorDecision\" /\u003e\n  \u003cerror to=\"VaildatorError\" /\u003e\n\u003c/action\u003e\n\n\u003cdecision name=\"ValidatorDecision\"\u003e\n  \u003cswitch\u003e\n    \u003ccase to=\"ValidatorCheckFail\"\u003e${wf:actionData('RunDataValidator')['DATA_VALIDATOR_STATUS'] eq \"FAIL\"}\u003c/case\u003e\n    \u003ccase to=\"ValidatorCheckPass\"\u003e${wf:actionData('RunDataValidator')['DATA_VALIDATOR_STATUS'] eq \"PASS\"}\u003c/case\u003e\n    \u003cdefault to=\"ValidatorNeither\"/\u003e\n  \u003c/switch\u003e\n\u003c/decision\u003e\n\n\u003caction name=\"ValidatorCheckFail\"\u003e\n  \u003c!-- Handle Failed Check --\u003e\n\u003c/action\u003e\n\n\u003caction name=\"ValidatorCheckPass\"\u003e\n  \u003c!-- Everything is Wonderful! --\u003e\n\u003c/action\u003e\n\n\u003caction name=\"ValidatorFailure\"\u003e\n  \u003c!-- Notify devs of validator failure --\u003e\n\u003c/action\u003e\n```\n\n## Other tools included\n\n### Configuration parser check\n\n`com.target.data_validator.ConfigParser` has an entrypoint that will check that the configuration file\nis parseable. It _does not_ validate variable substitutions since those have runtime implications.\n\n```shell\nspark-submit \\\n  --class com.target.data_validator.ConfigParser \\\n  --files config.yml \\\n  data-validator-assembly-0.14.1.jar \\\n    config.yml\n```\n\nIf there is an error, DV will print a message and exit non-zero.\n\n## Development Tools\n\n### Generate testing data with GenTestData or `sbt generateTestData` \n\nData Validator includes a tool to generate a sample `.orc` file for use in local development.\nThis repo's SBT configuration wraps the tool in a convenient SBT task: `sbt generateTestData`  \nIf you run this program or task, it will generate a file `testData.orc` in the current directory. \nYou can then use the following config file to test the `data-validator`. \nIt will generate a `report.json` and `report.html`.\n\n```sh\nspark-submit \\\n  --master \"local[*]\"  \\\n  data-validator-assembly-0.14.1.jar \\\n  --config local_validators.yaml \\\n  --jsonReport report.json  \\\n  --htmlReport report.html\n```\n\n####  `local_validators.yaml`\n\n```yaml\n---\nnumKeyCols: 2\nnumErrorsToReport: 5\ndetailedErrors: true\n\ntables:\n  - orcFile: testData.orc\n\n    checks:\n      - type: rowCount\n        minNumRows: 1000\n\n      - type: nullCheck\n        column: nullCol\n```\n\n## History\n\nThis tool is based on methods described in _Methodology for Data Validation 1.0_ by Di Zio et al., published by Esset Validat Foundation in 2016. You can [download the paper here](https://ec.europa.eu/eurostat/cros/system/files/methodology_for_data_validation_v1.0_rev-2016-06_final.pdf).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarget%2Fdata-validator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftarget%2Fdata-validator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftarget%2Fdata-validator/lists"}