https://github.com/sodadata/soda-spark

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
https://github.com/sodadata/soda-spark

data-engineering data-observability data-quality data-testing pyspark python soda-sql spark

Last synced: 12 months ago
JSON representation

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Host: GitHub
URL: https://github.com/sodadata/soda-spark
Owner: sodadata
License: apache-2.0
Created: 2021-08-30T09:05:17.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2022-06-22T20:19:51.000Z (about 4 years ago)
Last Synced: 2025-06-14T07:57:16.260Z (about 1 year ago)
Topics: data-engineering, data-observability, data-quality, data-testing, pyspark, python, soda-sql, spark
Language: Python
Homepage: https://docs.soda.io
Size: 118 KB
Stars: 64
Watchers: 14
Forks: 8
Open Issues: 6
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          
Soda Spark

Data testing, monitoring, and profiling for Spark Dataframes.



  

  

  

  



Soda Spark is an extension of

[Soda SQL](https://docs.soda.io/soda-sql/5_min_tutorial.html) that allows you to run Soda

SQL functionality programmatically on a

[Spark data frame](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html).

Soda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.

## Requirements

Soda Spark has the same requirements as

[`soda-sql-spark`](https://docs.soda.io/soda-sql/installation.html).

## Install

From your shell, execute the following command.

``` sh

$ pip install soda-spark

```

## Use

From your Python prompt, execute the following commands.

``` python

>>> from pyspark.sql import DataFrame, SparkSession

>>> from sodaspark import scan

>>>

>>> spark_session = SparkSession.builder.getOrCreate()

>>>

>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"

>>> df = spark_session.createDataFrame([

...	   {"id": id, "name": "Paula Landry", "size": 3006},

...	   {"id": id, "name": "Kevin Crawford", "size": 7243}

... ])

>>>

>>> scan_definition = ("""

... table_name: demodata

... metrics:

... - row_count

... - max

... - min_length

... tests:

... - row_count > 0

... columns:

...   id:

...     valid_format: uuid

...     tests:

...     - invalid_percentage == 0

... sql_metrics:

... - sql: |

...     SELECT sum(size) as total_size_us

...     FROM demodata

...     WHERE country = 'US'

...   tests:

...   - total_size_us > 5000

... """)

>>> scan_result = scan.execute(scan_definition, df)

>>>

>>> scan_result.measurements  # doctest: +ELLIPSIS

[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]

>>> scan_result.test_results  # doctest: +ELLIPSIS

[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]

>>>

```

Or, use a [scan YAML](https://docs.soda.io/soda-sql/scan-yaml.html) file

``` python

>>> scan_yml = "static/demodata.yml"

>>> scan_result = scan.execute(scan_yml, df)

>>>

>>> scan_result.measurements  # doctest: +ELLIPSIS

[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]

>>>

```

See the

[scan result object](https://github.com/sodadata/soda-sql/blob/main/core/sodasql/scan/scan_result.py)

for all attributes and methods.

Or, return Spark data frames:

``` python

>>> measurements, test_results, errors = scan.execute(scan_yml, df, as_frames=True)

>>>

>>> measurements  # doctest: +ELLIPSIS

DataFrame[metric: string, column_name: string, value: string, ...]

>>> test_results  # doctest: +ELLIPSIS

DataFrame[test: struct<...>, passed: boolean, skipped: boolean, values: map, ...]

>>>

```

See the `_to_data_frame` functions in the [`scan.py`](./src/sodaspark/scan.py)

to see how the conversion is done.

### Send results to Soda cloud

Send the scan result to Soda cloud.

``` python

>>> import os

>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient

>>>

>>> soda_server_client = SodaServerClient(

...     host="cloud.soda.io",

...     api_key_id=os.getenv("API_PUBLIC"),

...     api_key_secret=os.getenv("API_PRIVATE"),

... )

>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)

>>>

```

## Understand

Under the hood `soda-spark` does the following.

1. Setup the scan

   * Use the Spark dialect

   * Use [Spark session](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html)

     as [warehouse](https://docs.soda.io/soda-sql/warehouse.html) connection

2. Create (or replace)

   [global temporary view](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.createOrReplaceGlobalTempView.html)

   for the Spark data frame

3. Execute the scan on the temporary view

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sodadata/soda-spark

Awesome Lists containing this project

README

Soda Spark