{"id":24998975,"url":"https://github.com/tomasfarias/spark-test","last_synced_at":"2025-04-12T07:53:29.720Z","repository":{"id":57469611,"uuid":"194975523","full_name":"tomasfarias/spark-test","owner":"tomasfarias","description":"A collection of assertion functions to test Spark Collections like DataFrames!","archived":false,"fork":false,"pushed_at":"2020-09-25T20:21:43.000Z","size":35,"stargazers_count":3,"open_issues_count":2,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-04-12T07:53:24.364Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomasfarias.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-07-03T03:48:51.000Z","updated_at":"2020-09-25T19:37:11.000Z","dependencies_parsed_at":"2022-09-19T09:50:21.652Z","dependency_job_id":null,"html_url":"https://github.com/tomasfarias/spark-test","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomasfarias%2Fspark-test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomasfarias%2Fspark-test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomasfarias%2Fspark-test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomasfarias%2Fspark-test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomasfarias","download_url":"https://codeload.github.com/tomasfarias/spark-test/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248537038,"owners_count":21120691,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-04T18:52:38.550Z","updated_at":"2025-04-12T07:53:29.697Z","avatar_url":"https://github.com/tomasfarias.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"spark-test\n==========\n\n.. image:: https://travis-ci.com/tomasfarias/spark-test.svg?branch=master\n    :target: https://travis-ci.com/tomasfarias/spark-test\n\nA collection of assertion functions to test Spark Collections like DataFrames\n\nMotivation\n----------\n\nAs you develop Spark applications, you can eventually end up writing methods that apply transformations over Spark DataFrames. In order to test the results, you can create ``pandas`` DataFrames and use the test functions provided by ``pandas`` as ``pyspark`` does not provide any functions to assist with testing.\n\n``spark-test`` provides testing functions similar to ``pandas`` but geared towards Spark Collections.\n\nLet's say you have a function to apply some transformations on a Spark DataFrame (the full code for this example can be found in tests/test_example.py:\n\n::\n\n  def transform(df):\n      \"\"\"\n      Fill nulls with 0, sum 10 to Age column and only return distinct rows\n      \"\"\"\n\n      df = df.na.fill(0)\n      df = df.withColumn('Age', df['Age'] + 10)\n      df = df.distinct()\n\n      return df\n\nWe can then write a test case with as many test inputs as we need and test the results with ``assert_dataframe_equal``:\n\n::\n\n  from spark_test.testing import assert_dataframe_equal\n\n\n  def test_transform(spark, transform):\n\n      input_df = spark.createDataFrame(\n          [['Tom', 25], ['Tom', 25], ['Charlie', 24], ['Dan', None]],\n          schema=['Name', 'Age']\n      )\n\n      expected = spark.createDataFrame(\n          [['Tom', 35], ['Charlie', 34], ['Dan', 0]],\n          schema=['Name', 'Age']\n      )\n      result = transform(input_df)\n\n      assert_frame_equal(expected, result)\n\nOf course, tests are more interesting when they fail so let's introduce a bug in our ``transform`` function:\n\n::\n\n  def bugged_transform(df):\n      \"\"\"\n      Fill nulls with 0, sum 10 to Age column and only return distinct rows\n      \"\"\"\n\n      df = df.na.fill(1)  # Whoops! Should be 0!\n      df = df.withColumn('Age', df['Age'] + 10)\n      df = df.distinct()\n\n      return df\n\nPassing both functions to our test using ``pytest.mark.parametize`` yields the following output with a nice message on what failed:\n\n::\n\n  $ pytest tests/example.py\n  ============================= test session starts =============================\n  platform linux -- Python 3.7.3, pytest-5.0.0, py-1.8.0, pluggy-0.12.0\n  rootdir: /home/tfarias/repos/spark-test\n  collected 2 items\n\n  tests/example.py .F                                                [100%]\n\n  ================================== FAILURES ===================================\n  _______________________ test_transform[bugged_transform] ________________________\n\n              assert left_d[key] == right_d[key], msg.format(\n  \u003e               field=key, l_value=left_d[key], r_value=right_d[key]\n              )\n  E           AssertionError: Values for Age do not match:\n  E           Left=10\n  E           Right=11\n\n\nLicense\n-------\n\nDistributed under the MIT License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomasfarias%2Fspark-test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomasfarias%2Fspark-test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomasfarias%2Fspark-test/lists"}