Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/piotr-kalanski/data-quality-monitoring
Data Quality Monitoring Tool
https://github.com/piotr-kalanski/data-quality-monitoring
data-quality monitoring scala spark
Last synced: 4 months ago
JSON representation
Data Quality Monitoring Tool
- Host: GitHub
- URL: https://github.com/piotr-kalanski/data-quality-monitoring
- Owner: piotr-kalanski
- License: apache-2.0
- Created: 2017-08-30T11:48:51.000Z (over 7 years ago)
- Default Branch: development
- Last Pushed: 2017-12-05T09:48:40.000Z (about 7 years ago)
- Last Synced: 2024-10-11T04:47:27.688Z (4 months ago)
- Topics: data-quality, monitoring, scala, spark
- Language: Scala
- Size: 116 KB
- Stars: 16
- Watchers: 2
- Forks: 4
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-data-quality - dqm - another data quality monitoring tool implemented using Spark. (Table of Contents / Frameworks and Libraries)
README
# data-quality-monitoring
Data Quality Monitoring Tool for Big Data implemented using Spark[](https://api.travis-ci.org/piotr-kalanski/data-quality-monitoring.png?branch=development)
[](http://codecov.io/github/piotr-kalanski/data-quality-monitoring/coverage.svg?branch=development)
[](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22data-quality-monitoring_2.11%22)
[](http://www.apache.org/licenses/LICENSE-2.0.txt)# Table of contents
- [Goals](#goals)
- [Getting started](#getting-started)
- [Data quality monitoring process](#data-quality-monitoring-process)
- [Load configuration](#load-configuration)
- [Example configuration](#example-configuration)
- [Load configuration from file](#load-configuration-from-file)
- [Load configuration from directory](#load-configuration-from-directory)
- [Load configuration from database](#load-configuration-from-database)
- [Validation rules](#validation-rules)
- [Field rules](#field-rules)
- [Group rules](#group-rules)
- [Table trend rules](#table-trend-rules)
- [Log validation results](#log-validation-results)
- [Send alerts](#send-alerts)
- [Full example](#full-example)# Goals
- Validate data using provided business rules
- Log result
- Send alerts# Getting started
Include dependency:
```scala
"com.github.piotr-kalanski" % "data-quality-monitoring_2.11" % "0.3.2"
```or
```xml
com.github.piotr-kalanski
data-quality-monitoring_2.11
0.3.2```
# Data quality monitoring process
Data quality monitoring process consists from below steps:
- Load configuration with business rules
- Run data validation
- Log validation results
- Send alerts## Load configuration
Configuration can be loaded from:
- file
- directory
- RDBMSAdditionally there are plans to support:
- Dynamo DB### Example configuration
```
tablesConfiguration = [
{
location = {type = Hive, table = clients}, // location of first table that should be validated
rules = { // validation rules
rowRules = [ // validation rules working on single row level
{
field = client_id, // name of field that should be validated
rules = [
{type = NotNull}, // this field shouldn't be null
{type = min, value = 0} // minimum value for this field is 0
]
},
{
field = client_name,
rules = [
{type = NotNull} // this field shouldn't be null
]
}
]
}
},
{
location = {type = Hive, table = companies}, // location of first table that should be validated
rules = {
rowRules = [
{
field = company_id, // name of field that should be validated
rules = [
{type = NotNull}, // this field shouldn't be null
{type = max, value = 100} // maximum value for this field is 100
]
},
{
field = company_name, // name of field that should be validated
rules = [
{type = NotNull} // this field shouldn't be null
]
}
]
}
}
]
```### Load configuration from file
Use class: ```FileSingleTableConfigurationLoader``` or ```FileMultipleTablesConfigurationLoader```.
Example:
```scala
import com.datawizards.dqm.configuration.loader.FileMultipleTablesConfigurationLoader
val configurationLoader = new FileMultipleTablesConfigurationLoader("configuration.conf")
configurationLoader.loadConfiguration()
```### Load configuration from directory
Use class: `DirectoryConfigurationLoader`.
One file should contain configuration for one table (TableConfiguration).
### Load configuration from database
Use class: `DatabaseConfigurationLoader`.
One table row should contain configuration for one table (TableConfiguration).
## Validation rules
Currently supported categories of data validation rules:
- field rules - validating value of single field e.g.: not null, min value, max value
- group rules - validating result of group by expression e.g.: expected groups (countries, types)
- table trend rules - validating table trend rules e.g.: comparing current day row count vs previous day row count### Field rules
Field rules should be defined in section ```rules.rowRules```:
```
tablesConfiguration = [
{
location = [...],
rules = {
rowRules = [
{
field = Field name,
rules = [...]
}
]
}
}
]
```Supported field validation rules:
- not null```{type = NotNull}```
- dictionary
```{type = dict, values=[1,2,3]}```
- regex```{type = regex, value = """\s.*"""}```
- min value```{type = min, value = 0}```
- max value```{type = max, value = 100}```
### Group rules
Group rules should be defined in section ```groups.rules```:
```
tablesConfiguration = [
{
location = [...],
rules = [...],
groups = [
{
name = Group name,
field = Group by field name,
rules = [
{
type = NotEmptyGroups,
expectedGroups = [c1,c2,c3,c4]
}
]
}
]
}
]
```Supported group validation rules:
- not empty groups```{type = NotEmptyGroups, expectedGroups = [c1,c2,c3,c4]}```
### Table trend rules
Table trend rules should be defined in section ```rules.tableTrendRules```:
```
tablesConfiguration = [
{
location = [...],
rules = {
rowRules = [...],
tableTrendRules = [
{type = CurrentVsPreviousDayRowCountIncrease, tresholdPercentage = 20}
]
}
}
]
```Supported table trends validation rules:
- current vs previous day row count```{type = CurrentVsPreviousDayRowCountIncrease, tresholdPercentage = 20}```
## Log validation results
Validation results can be logged into:
- Elasticsearch using class `ElasticsearchValidationResultLogger````scala
val logger = new ElasticsearchValidationResultLogger(
esUrl = "http://localhost:9200", // Elasticsearch URL
invalidRecordsIndexName = "invalid_records", // Index name where to store invalid records
tableStatisticsIndexName = "table_statistics", // Index name where to store table statistics
columnStatisticsIndexName = "column_statistics", // Index name where to store column statistics
groupsStatisticsIndexName = "group_statistics", // Index name where to store group statistics
invalidGroupsIndexName = "invalid_groups" // Index name where to store group statistics
)
```
- RDBMS using class `DatabaseValidationResultLogger````scala
val logger = new DatabaseValidationResultLogger(
driverClassName = "org.h2.Driver", // JDBC driver class name
dbUrl = connectionString, // DB connection string
connectionProperties = new Properties(), // JDBC connection properties, especially user and password
invalidRecordsTableName = "INVALID_RECORDS", // name of table where to insert invalid records
tableStatisticsTableName = "TABLE_STATISTICS", // name of table where to insert table statistics records
columnStatisticsTableName = "COLUMN_STATISTICS", // name of table where to insert column statistics records
groupsStatisticsTableName = "GROUP_STATISTICS", // name of table where to insert group by statistics records
invalidGroupsTableName = "INVALID_GROUPS" // name of table where to insert invalid groups
)
```## Send alerts
Alerts can be send to:
- Slack using class `SlackAlertSender`Additionally there are plans to support:
# Full example
## Example
```scala
import com.datawizards.dqm.configuration.loader.FileConfigurationLoader
import com.datawizards.dqm.logger.ElasticsearchValidationResultLogger
import com.datawizards.dqm.alert.SlackAlertSender
import com.datawizards.dqm.DataQualityMonitorval configurationLoader = new FileConfigurationLoader("configuration.conf")
val esUrl = "http://localhost:9200"
val invalidRecordsIndexName = "invalid_records"
val tableStatisticsIndexName = "table_statistics"
val columnStatisticsIndexName = "column_statistics"
val groupsStatisticsIndexName = "group_statistics"
val invalidGroupsIndexName = "invalid_groups"
private val logger = new ElasticsearchValidationResultLogger(esUrl, invalidRecordsIndexName, tableStatisticsIndexName, columnStatisticsIndexName, groupsStatisticsIndexName, invalidGroupsIndexName)
val alertSender = new SlackAlertSender("webhook url", "Slack channel", "Slack user name")
val processingDate = new java.util.Date()
DataQualityMonitor.run(processingDate, configurationLoader, logger, alertSender)
```configuration.conf:
```
tablesConfiguration = [
{
location = {type = Hive, table = clients},
rules = {
rowRules = [
{
field = client_id,
rules = [
{type = NotNull}
]
}
]
}
}
]
```