Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/samueleresca/deequ.net
deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
https://github.com/samueleresca/deequ.net
apache-spark bigdata deequ dotnet spark
Last synced: 2 months ago
JSON representation
deequ.NET is a port of the awslabs/deequ library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
- Host: GitHub
- URL: https://github.com/samueleresca/deequ.net
- Owner: samueleresca
- Created: 2020-06-06T18:22:09.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2024-07-09T21:24:19.000Z (6 months ago)
- Last Synced: 2024-11-01T07:08:38.003Z (3 months ago)
- Topics: apache-spark, bigdata, deequ, dotnet, spark
- Language: C#
- Homepage:
- Size: 6.44 MB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# deequ.NET
[![deequ.NET](https://github.com/samueleresca/deequ.net/workflows/deequ.NET/badge.svg)](https://github.com/samuelereca/deeuqu.NET)
[![codecov](https://codecov.io/gh/samueleresca/deequ.net/branch/master/graph/badge.svg)](https://codecov.io/gh/samueleresca/deequ.net)
[![Nuget](https://img.shields.io/nuget/vpre/deequ)](https://www.nuget.org/packages/deequ)
[![NuGet](https://img.shields.io/nuget/dt/deequ)](https://www.nuget.org/packages/deequ)⚠️*Warning*: The library is still in alpha, and it is not fully tested.
**deequ.NET** is a port of the [awslabs/deequ](https://github.com/awslabs/deequ) library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
deequ.NET runs on [dotnet/spark](https://github.com/dotnet/spark).## Requirements and Installation
deequ.NET runs on Apache Spark and depends on [dotnet/spark](https://github.com/dotnet/spark). Therefore it is required to install the following dependencies locally:
- [.NET Core 3.1 SDK](https://dotnet.microsoft.com/download/dotnet-core/3.1)
- [OpenJDK 8](https://openjdk.java.net/install/)
- [Apache Spark 2.3 +](https://archive.apache.org/dist/spark/)It is also necessary to install the [Microsoft.Spark.Worker](https://github.com/dotnet/spark/releases) on your local machine and configure the path into the `PATH` env var.
For a detailed instructions, see [dotnet/spark - Getting started](https://github.com/dotnet/spark/tree/master/docs/getting-started)## Usage
The following example implements a set of checks on some records and it submits the execution using the `spark-submit` command.
- Use the `dotnet` CLI to create a console application:
```shell
dotnet new console -o DeequExample
```
- Install `Microsoft.Spark` and the `deequ` Nuget packages into the project:```shell
cd DeequExampledotnet add package Microsoft.Spark
dotnet add package deequ
```
- Replace the contents of the `Program.cs` file with the following code:```csharp
using deequ;
using deequ.Checks;
using deequ.Extensions;
using Microsoft.Spark.Sql;namespace DeequExample
{
class Program
{
static void Main(string[] args)
{
SparkSession spark = SparkSession.Builder().GetOrCreate();
DataFrame data = spark.Read().Json("inventory.json");data.Show();
VerificationResult verificationResult = new VerificationSuite()
.OnData(data)
.AddCheck(
new Check(CheckLevel.Error, "integrity checks")
.HasSize(value => value == 5)
.IsComplete("id")
.IsUnique("id")
.IsComplete("productName")
.IsContainedIn("priority", new[] { "high", "low" })
.IsNonNegative("numViews")
)
.AddCheck(
new Check(CheckLevel.Warning, "distribution checks")
.ContainsURL("description", value => value >= .5)
)
.Run();verificationResult.Debug();
}
}
}
```
- Use the `dotnet` CLI to build the application:```shell
dotnet build
```### Running the example
- Open your terminal and navigate into your app folder.
```shell
cd
```
- Create `inventory.json` with the following content:```json
{"id":1, "productName":"Thingy A", "description":"awesome thing. http://thingb.com", "priority":"high", "numViews":0}
{"id":2, "productName":"Thingy B", "description":"available at http://thingb.com","priority":null, "numViews":0}
{"id":3, "productName":"Thingy C", "description": null, "priority":"low", "numViews":5}
{"id":4, "productName":"Thingy D", "description": "checkout https://thingd.ca", "priority":"low","numViews": 10}
{"id":5, "productName":"Thingy E", "description":null, "priority":"high","numViews": 12}
```
- Run your app.```shell
spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master local \
microsoft-spark-2.4.x-.jar \
dotnet DeequExample.dll
```
**Note**: This command requires Apache Spark in your PATH environment variable to be able to use `spark-submit`. For detailed instructions, you can see [Building .NET for Apache Spark from Source on Ubuntu](../building/ubuntu-instructions.md).
- The output of the application should look similar to the output below:```text
_ _ _ ______ _______
| | | \ | | ____|__ __|
__| | ___ ___ __ _ _ _ | \| | |__ | |
/ _` |/ _ \/ _ \/ _` | | | | | . ` | __| | |
| (_| | __/ __/ (_| | |_| |_| |\ | |____ | |
\__,_|\___|\___|\__, |\__,_(_)_| \_|______| |_|
| |
|_|Success
```## More examples
The following list shows more examples/showcases of the deequ.NET API:
- [Incremental metrics computation](examples/examples/incremental_metrics_example.md)
- [Anomaly detection](examples/examples/anomaly_detection.md)## Credits
- [awslabs/deequ](https://github.com/awslabs/deequ)
- [dotnet/spark](https://github.com/dotnet/spark)
- [akkadotnet/akka.net - Scala to C# conversion guide](https://github.com/akkadotnet/akka.net/wiki/Scala-to-C%23-Conversion-Guide)## Citation
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. [Automating large-scale data quality verification](http://www.vldb.org/pvldb/vol11/p1781-schelter.pdf). Proc. VLDB Endow. 11, 12 (August 2018), 1781-1794.