https://github.com/pilillo/gilberto

apache-spark data-quality data-quality-monitoring dataops k8s scala spark

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/pilillo/gilberto
Owner: pilillo
Created: 2021-03-13T10:30:49.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2021-10-18T13:07:56.000Z (over 4 years ago)
Last Synced: 2024-06-12T17:55:20.037Z (over 1 year ago)
Topics: apache-spark, data-quality, data-quality-monitoring, dataops, k8s, scala, spark
Language: Shell
Homepage:
Size: 199 KB
Stars: 3
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Gilberto

A nice guy you may rely on for your Data Quality;

![Gilberto](img/gilberto_wide.png)

## Pipelines

Gilberto uses [AWS Deequ](https://github.com/awslabs/deequ) and [Apache Spark](https://spark.apache.org/) at its core.

Specifically, you can instantiate:
* `profile` to perform profiling of any Spark DataFrame - loaded from a Hive table
* `suggest` to perform constraint suggestion based on selected data distribution
* `validate` to perform a data quality validation step, based on an input ***check*** file
* `detect` to perform anomaly detection, based on an input ***strategy*** file

Gilberto is meant to be run as a step within a workflow manager (e.g. with the workflow failing in case of data inconsistencies), pulling data from a remote Hadoop/Hive cluster or S3/Presto datalake and pushing data to specific MetricsRepositories.

## Usage

The input arguments are handled using [scopt](https://github.com/scopt/scopt). Gilberto is expected a data source (either a Hive table or a path), an action (or pipeline) to perform on the data, as well as a time interval `(from, to)`, along with a destination file path.
For the validator, Gilberto also expects a repository target, an endpoint or a file path, along with a code config file specifying the checks to perform on source data.
For the profiler and the suggester, you may also specify a list of columns to use for partitioning the resulting dataframe,
such as `PROC_YEAR,PROC_MONTH,PROC_DAY` to use processing date columns or respectively `START_*` and `END_*` for the beginning and end of the selected date interval.

Specifically:
```bash
-a, --action action Action is the pipeline to be run
-s, --source source Source defines where to load data from
-d, --destination destination Destination defines where to save results
-f, --from date Beginning of the time interval (default as yyyy-MM-dd)
-t, --to date End of the time interval (default as yyyy-MM-dd)
-r, --repository target Target folder or endpoint of the repository
-c, --code-config-path path Path of the file containing the checks to instruct the validator
-p, --partition-by columns Columns to use to partition the resulting dataframe
```

Here an example check file `checks.gibo` for the `validate` step:
```scala
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
Seq(
Check(CheckLevel.Error, "data testing with error level")
.hasSize(_ >0)
.hasMin("numViews", _ > 12)
)
```

The file is interpreted as Scala code using reflection and applied as Checks (see [`addChecks`](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/VerificationRunBuilder.scala#L86)) on a Validator instance.

Similarly, here is an example strategy file `strategy.gibo` for the `detect` step:
```scala
import com.amazon.deequ.anomalydetection._
import com.amazon.deequ.analyzers._
(
RelativeRateOfChangeStrategy(maxRateIncrease = Some(2.0)),
Size()
)
```
which is interpreted as tuple of kind `AnomalyDetectionStrategy, Analyzer[S, DoubleMetric]` and applied to a Detector instance (see [anomaly detection example](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/anomaly_detection_example.md)).

## Metrics repositories

Deequ provides an [InMemoryMetricsRepository](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/repository/memory/InMemoryMetricsRepository.scala) and a [FileSystemMetricsRepository](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/repository/fs/FileSystemMetricsRepository.scala).
The former is basically a Concurrent Hashmap, while the second is a connector allowing for writing a json file of kind `metrics.json` to HDFS or S3.
Clearly, this has various drawbacks. Most of all, writing to a unique file blob all metrics does not scale and does not allow for querying from Presto and other engines alike.

We provide the following metrics repositories:
* [`MastroMetricsRepository`](https://github.com/pilillo/gilberto/tree/master/src/main/scala/com/amazon/deequ/repository/mastro) pushing to the [Mastro](https://github.com/data-mill-cloud/mastro) catalogue and metrics repo; consequently allowing for lineage tracking and source discovery;
* [`QuerableMetricsRepository`](https://github.com/pilillo/gilberto/tree/master/src/main/scala/com/amazon/deequ/repository/querable) based on the existing `FileSystemMetricsRepository` but writing DataFrames as partitioned parquet files instead of as a unique json file;

## Deployment

* [YARN](YARN_DEPLOY.md)
* [K8s](K8S_DEPLOY.md)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pilillo/gilberto

Awesome Lists containing this project

README