{"id":28404680,"url":"https://github.com/sodadata/soda-spark","last_synced_at":"2025-07-26T03:07:50.127Z","repository":{"id":37765844,"uuid":"401280781","full_name":"sodadata/soda-spark","owner":"sodadata","description":"Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes","archived":false,"fork":false,"pushed_at":"2022-06-22T20:19:51.000Z","size":121,"stargazers_count":64,"open_issues_count":6,"forks_count":8,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-06-14T07:57:16.260Z","etag":null,"topics":["data-engineering","data-observability","data-quality","data-testing","pyspark","python","soda-sql","spark"],"latest_commit_sha":null,"homepage":"https://docs.soda.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sodadata.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-08-30T09:05:17.000Z","updated_at":"2025-06-14T00:43:36.000Z","dependencies_parsed_at":"2022-09-19T10:10:25.972Z","dependency_job_id":null,"html_url":"https://github.com/sodadata/soda-spark","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/sodadata/soda-spark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sodadata","download_url":"https://codeload.github.com/sodadata/soda-spark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sodadata%2Fsoda-spark/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259782827,"owners_count":22910278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-observability","data-quality","data-testing","pyspark","python","soda-sql","spark"],"created_at":"2025-06-01T20:37:31.050Z","updated_at":"2025-06-27T12:32:15.950Z","avatar_url":"https://github.com/sodadata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\u003ch1\u003eSoda Spark\u003c/h1\u003e\u003cbr/\u003e\u003cb\u003eData testing, monitoring, and profiling for Spark Dataframes.\u003c/b\u003e\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/sodadata/soda-spark/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202-blue.svg\" alt=\"License: Apache 2.0\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://join.slack.com/t/soda-community/shared_invite/zt-m77gajo1-nXJF7JtbbRht2zwaiLb9pg\"\u003e\u003cimg alt=\"Slack\" src=\"https://img.shields.io/badge/chat-slack-green.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/soda-spark/\"\u003e\u003cimg alt=\"Pypi Soda PARK\" src=\"https://img.shields.io/badge/pypi-soda%20spark-green.svg\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/sodadata/soda-spark/actions/workflows/build.yml\"\u003e\u003cimg alt=\"Build soda-spark\" src=\"https://github.com/sodadata/soda-spark/actions/workflows/workflow.yml/badge.svg\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\nSoda Spark is an extension of\n[Soda SQL](https://docs.soda.io/soda-sql/5_min_tutorial.html) that allows you to run Soda\nSQL functionality programmatically on a\n[Spark data frame](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.html).\n\nSoda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface \"bad\" data that you can fix to ensure that downstream analysts are using \"good\" data to make decisions.\n\n\n## Requirements\n\nSoda Spark has the same requirements as\n[`soda-sql-spark`](https://docs.soda.io/soda-sql/installation.html).\n\n## Install\n\nFrom your shell, execute the following command.\n\n``` sh\n$ pip install soda-spark\n```\n\n## Use\n\nFrom your Python prompt, execute the following commands.\n\n``` python\n\u003e\u003e\u003e from pyspark.sql import DataFrame, SparkSession\n\u003e\u003e\u003e from sodaspark import scan\n\u003e\u003e\u003e\n\u003e\u003e\u003e spark_session = SparkSession.builder.getOrCreate()\n\u003e\u003e\u003e\n\u003e\u003e\u003e id = \"a76824f0-50c0-11eb-8be8-88e9fe6293fd\"\n\u003e\u003e\u003e df = spark_session.createDataFrame([\n...\t   {\"id\": id, \"name\": \"Paula Landry\", \"size\": 3006},\n...\t   {\"id\": id, \"name\": \"Kevin Crawford\", \"size\": 7243}\n... ])\n\u003e\u003e\u003e\n\u003e\u003e\u003e scan_definition = (\"\"\"\n... table_name: demodata\n... metrics:\n... - row_count\n... - max\n... - min_length\n... tests:\n... - row_count \u003e 0\n... columns:\n...   id:\n...     valid_format: uuid\n...     tests:\n...     - invalid_percentage == 0\n... sql_metrics:\n... - sql: |\n...     SELECT sum(size) as total_size_us\n...     FROM demodata\n...     WHERE country = 'US'\n...   tests:\n...   - total_size_us \u003e 5000\n... \"\"\")\n\u003e\u003e\u003e scan_result = scan.execute(scan_definition, df)\n\u003e\u003e\u003e\n\u003e\u003e\u003e scan_result.measurements  # doctest: +ELLIPSIS\n[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]\n\u003e\u003e\u003e scan_result.test_results  # doctest: +ELLIPSIS\n[TestResult(test=Test(..., expression='row_count \u003e 0', ...), passed=True, skipped=False, ...)]\n\u003e\u003e\u003e\n```\n\nOr, use a [scan YAML](https://docs.soda.io/soda-sql/scan-yaml.html) file\n\n``` python\n\u003e\u003e\u003e scan_yml = \"static/demodata.yml\"\n\u003e\u003e\u003e scan_result = scan.execute(scan_yml, df)\n\u003e\u003e\u003e\n\u003e\u003e\u003e scan_result.measurements  # doctest: +ELLIPSIS\n[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]\n\u003e\u003e\u003e\n```\n\nSee the\n[scan result object](https://github.com/sodadata/soda-sql/blob/main/core/sodasql/scan/scan_result.py)\nfor all attributes and methods.\n\nOr, return Spark data frames:\n\n``` python\n\u003e\u003e\u003e measurements, test_results, errors = scan.execute(scan_yml, df, as_frames=True)\n\u003e\u003e\u003e\n\u003e\u003e\u003e measurements  # doctest: +ELLIPSIS\nDataFrame[metric: string, column_name: string, value: string, ...]\n\u003e\u003e\u003e test_results  # doctest: +ELLIPSIS\nDataFrame[test: struct\u003c...\u003e, passed: boolean, skipped: boolean, values: map\u003cstring,string\u003e, ...]\n\u003e\u003e\u003e\n```\n\nSee the `_to_data_frame` functions in the [`scan.py`](./src/sodaspark/scan.py)\nto see how the conversion is done.\n\n### Send results to Soda cloud\n\nSend the scan result to Soda cloud.\n\n``` python\n\u003e\u003e\u003e import os\n\u003e\u003e\u003e from sodasql.soda_server_client.soda_server_client import SodaServerClient\n\u003e\u003e\u003e\n\u003e\u003e\u003e soda_server_client = SodaServerClient(\n...     host=\"cloud.soda.io\",\n...     api_key_id=os.getenv(\"API_PUBLIC\"),\n...     api_key_secret=os.getenv(\"API_PRIVATE\"),\n... )\n\u003e\u003e\u003e scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)\n\u003e\u003e\u003e\n```\n\n## Understand\n\nUnder the hood `soda-spark` does the following.\n\n1. Setup the scan\n   * Use the Spark dialect\n   * Use [Spark session](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.html)\n     as [warehouse](https://docs.soda.io/soda-sql/warehouse.html) connection\n2. Create (or replace)\n   [global temporary view](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.createOrReplaceGlobalTempView.html)\n   for the Spark data frame\n3. Execute the scan on the temporary view\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodadata%2Fsoda-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsodadata%2Fsoda-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsodadata%2Fsoda-spark/lists"}