{"id":13795378,"url":"https://github.com/mrpowers-io/spark-fast-tests","last_synced_at":"2025-12-16T00:41:57.375Z","repository":{"id":37706560,"uuid":"87477549","full_name":"mrpowers-io/spark-fast-tests","owner":"mrpowers-io","description":"Apache Spark testing helpers (dependency free \u0026 works with Scalatest, uTest, and MUnit)","archived":false,"fork":false,"pushed_at":"2025-04-06T11:41:46.000Z","size":2291,"stargazers_count":443,"open_issues_count":29,"forks_count":78,"subscribers_count":15,"default_branch":"main","last_synced_at":"2025-04-07T21:14:45.747Z","etag":null,"topics":["spark","testing-framework"],"latest_commit_sha":null,"homepage":"https://mrpowers-io.github.io/spark-fast-tests/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrpowers-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-06T21:40:52.000Z","updated_at":"2025-04-06T11:41:50.000Z","dependencies_parsed_at":"2024-08-10T09:15:45.393Z","dependency_job_id":"d6d05567-5187-4535-9a07-f0ec0a82be89","html_url":"https://github.com/mrpowers-io/spark-fast-tests","commit_stats":{"total_commits":320,"total_committers":20,"mean_commits":16.0,"dds":0.328125,"last_synced_commit":"cffd91a9691de2f5603d8ad6ba0e5230c65e1f99"},"previous_names":["mrpowers-io/spark-fast-tests","mrpowers/spark-fast-tests"],"tags_count":46,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fspark-fast-tests","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fspark-fast-tests/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fspark-fast-tests/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrpowers-io%2Fspark-fast-tests/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrpowers-io","download_url":"https://codeload.github.com/mrpowers-io/spark-fast-tests/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254374475,"owners_count":22060611,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["spark","testing-framework"],"created_at":"2024-08-03T23:00:55.338Z","updated_at":"2025-12-16T00:41:57.312Z","avatar_url":"https://github.com/mrpowers-io.png","language":"Scala","funding_links":[],"categories":["Packages"],"sub_categories":["Testing"],"readme":"# Spark Fast Tests\n\n[![CI](https://github.com/MrPowers/spark-fast-tests/actions/workflows/ci.yml/badge.svg)](https://github.com/MrPowers/spark-fast-tests/actions/workflows/ci.yml)\n\nA fast Apache Spark testing helper library with beautifully formatted error messages!  Works\nwith [scalatest](https://github.com/scalatest/scalatest), [uTest](https://github.com/lihaoyi/utest),\nand [munit](https://github.com/scalameta/munit).\n\nUse [chispa](https://github.com/MrPowers/chispa) for PySpark applications.\n\nRead [Testing Spark Applications](https://leanpub.com/testing-spark) for a full explanation on the best way to test\nSpark code!  Good test suites yield higher quality codebases that are easy to refactor.\n\n## Table of Contents\n- [Install](#install)\n- [Examples](#simple-examples)\n- [Why is this library fast?](#why-is-this-library-fast)\n- [Usage](#usage)\n  - [Local Testing SparkSession](#local-sparksession-for-test)\n  - [DataFrameComparer](#datasetcomparer--dataframecomparer)\n    - [Unordered DataFrames comparison](#unordered-dataframe-equality-comparisons)\n    - [Approximate DataFrames comparison](#approximate-dataframe-equality)\n    - [Ignore Nullable DataFrames comparison](#equality-comparisons-ignoring-the-nullable-flag)\n  - [ColumnComparer](#column-equality)\n  - [SchemaComparer](#schema-equality)\n- [Testing tips](#testing-tips)\n\n\n## Install\n\nFetch the JAR file from Maven.\n\n```scala\n// for Spark 3\nlibraryDependencies += \"com.github.mrpowers\" %% \"spark-fast-tests\" % \"1.3.0\" % \"test\"\n```\n\n**Important: Future versions of spark-fast-test will no longer support Spark 2.x. We recommend upgrading to Spark 3.x to\nensure compatibility with upcoming releases.**\n\nHere's a link to the releases for different Scala versions:\n\n* [Scala 2.11 JAR files](https://repo1.maven.org/maven2/com/github/mrpowers/spark-fast-tests_2.11/)\n* [Scala 2.12 JAR files](https://repo1.maven.org/maven2/com/github/mrpowers/spark-fast-tests_2.12/)\n* [Scala 2.13 JAR files](https://repo1.maven.org/maven2/com/github/mrpowers/spark-fast-tests_2.13/)\n* [Legacy JAR files in Maven](https://mvnrepository.com/artifact/MrPowers/spark-fast-tests?repo=spark-packages).\n\nYou should use Scala 2.11 with Spark 2 and Scala 2.12 / 2.13 with Spark 3.\n\n## Simple examples\n\nThe `assertSmallDataFrameEquality` method can be used to compare two DataFrames.\n\n```scala\nval sourceDF = Seq(\n  (1),\n  (5)\n).toDF(\"number\")\n\nval expectedDF = Seq(\n  (1),\n  (3)\n).toDF(\"number\")\n\nassertSmallDataFrameEquality(sourceDF, expectedDF)\n```\n\n\u003cp\u003e\n    \u003cimg src=\"./images/assertSmallDataFrameEquality_DatasetContentMissmatch_message.png\" alt=\"assertSmallDataFrameEquality_DatasetContentMissmatch_message\" width=\"500\", height=\"200\"/\u003e\n\u003c/p\u003e\n\nThe `assertSmallDatasetEquality` method can be used to compare two Datasets or DataFrames(Dataset[Row]).\nNicely formatted error messages are displayed when the Datasets are not equal. Here is an example of content mismatch:\n\n```scala\nval sourceDS = Seq(\n  Person(\"juan\", 5),\n  Person(\"bob\", 1),\n  Person(\"li\", 49),\n  Person(\"alice\", 5)\n).toDS\n\nval expectedDS = Seq(\n  Person(\"juan\", 6),\n  Person(\"frank\", 10),\n  Person(\"li\", 49),\n  Person(\"lucy\", 5)\n).toDS\n```\n\n\u003cp\u003e\n    \u003cimg src=\"./images/assertSmallDatasetEquality_error_message.png\" alt=\"assertSmallDatasetEquality_error_message\" width=\"500\", height=\"200\"\u003e\n\u003c/p\u003e\n\nThe colors in the error message make it easy to identify the rows that aren't equal. These method also supports\ncomparing DataFrames with different schemas.\n\n```scala\nval sourceDF = spark.createDF(\n  List(\n    (1, 2.0),\n    (5, 3.0)\n  ),\n  List(\n    (\"number\", IntegerType, true),\n    (\"float\", DoubleType, true)\n  )\n)\n\nval expectedDF = spark.createDF(\n  List(\n    (1, \"word\", 1L),\n    (5, \"word\", 2L)\n  ),\n  List(\n    (\"number\", IntegerType, true),\n    (\"word\", StringType, true),\n    (\"long\", LongType, true)\n  )\n)\n\nassertSmallDataFrameEquality(sourceDF, expectedDF)\n```\n\n\u003cp\u003e\n    \u003cimg src=\"./images/assertSmallDataFrameEquality_DatasetSchemaMisMatch_message.png\" alt=\"assertSmallDataFrameEquality_DatasetSchemaMisMatch_message\" width=\"500\", height=\"200\"\u003e\n\u003c/p\u003e\n\nThe `DatasetComparer` has `assertSmallDatasetEquality` and `assertLargeDatasetEquality` methods to compare either\nDatasets or DataFrames.\n\nIf you only need to compare DataFrames, you can use `DataFrameComparer` with the associated\n`assertSmallDataFrameEquality` and `assertLargeDataFrameEquality` methods. Under the hood, `DataFrameComparer` uses the\n`assertSmallDatasetEquality` and `assertLargeDatasetEquality`.\n\n*Note : comparing Datasets can be tricky since some column names might be given by Spark when applying transformations.\nUse the `ignoreColumnNames` boolean to skip name verification.*\n\n## Why is this library fast?\n\nThis library provides three main methods to test your code.\n\nSuppose you'd like to test this function:\n\n```scala\ndef myLowerClean(col: Column): Column = {\n  lower(regexp_replace(col, \"\\\\s+\", \"\"))\n}\n```\n\nHere's how long the tests take to execute:\n\n| test method                    | runtime          |\n|--------------------------------|------------------|\n| `assertLargeDataFrameEquality` | 709 milliseconds |\n| `assertSmallDataFrameEquality` | 166 milliseconds |\n| `assertColumnEquality`         | 108 milliseconds |\n| `evalString`                   | 26 milliseconds  |\n\n`evalString` isn't as robust, but is the fastest.  `assertColumnEquality` is robust and saves a lot of time.\n\nOther testing libraries don't have methods like `assertSmallDataFrameEquality` or `assertColumnEquality` so they run\nslower.\n\n## Usage\n\n### Local SparkSession for test\nThe spark-fast-tests project doesn't provide a SparkSession object in your test suite, so you'll need to make one\nyourself.\n\n```scala\nimport org.apache.spark.sql.SparkSession\n\ntrait SparkSessionTestWrapper {\n\n  lazy val spark: SparkSession = {\n    SparkSession\n      .builder()\n      .master(\"local\")\n      .appName(\"spark session\")\n      .config(\"spark.sql.shuffle.partitions\", \"1\")\n      .getOrCreate()\n  }\n\n}\n```\n\nIt's best set the number of shuffle partitions to a small number like one or four in your test suite. This configuration\ncan make your tests run up to 70% faster. You can remove this configuration option or adjust it if you're working with\nbig DataFrames in your test suite.\n\nMake sure to only use the `SparkSessionTestWrapper` trait in your test suite. You don't want to use test specific\nconfiguration (like one shuffle partition) when running production code.\n\n### DatasetComparer / DataFrameComparer\nThe `DatasetComparer` trait defines the `assertSmallDatasetEquality` method. Extend your spec file with the\n`SparkSessionTestWrapper` trait to create DataFrames and the `DatasetComparer` trait to make DataFrame comparisons.\n\n```scala\nimport org.apache.spark.sql.types._\nimport org.apache.spark.sql.Row\nimport org.apache.spark.sql.functions._\nimport com.github.mrpowers.spark.fast.tests.DatasetComparer\n\nclass DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {\n\n  import spark.implicits._\n\n  it(\"aliases a DataFrame\") {\n\n    val sourceDF = Seq(\n      (\"jose\"),\n      (\"li\"),\n      (\"luisa\")\n    ).toDF(\"name\")\n\n    val actualDF = sourceDF.select(col(\"name\").alias(\"student\"))\n\n    val expectedDF = Seq(\n      (\"jose\"),\n      (\"li\"),\n      (\"luisa\")\n    ).toDF(\"student\")\n\n    assertSmallDatasetEquality(actualDF, expectedDF)\n\n  }\n}\n```\n\nTo compare large DataFrames that are partitioned across different nodes in a cluster, use the\n`assertLargeDatasetEquality` method.\n\n```scala\nassertLargeDatasetEquality(actualDF, expectedDF)\n```\n\n`assertSmallDatasetEquality` is faster for test suites that run on your local machine.  `assertLargeDatasetEquality`\nshould only be used for DataFrames that are split across nodes in a cluster.\n\n#### Unordered DataFrame equality comparisons\n\nSuppose you have the following `actualDF`:\n\n```\n+------+\n|number|\n+------+\n|     1|\n|     5|\n+------+\n```\n\nAnd suppose you have the following `expectedDF`:\n\n```\n+------+\n|number|\n+------+\n|     5|\n|     1|\n+------+\n```\n\nThe DataFrames have the same columns and rows, but the order is different.\n\n`assertSmallDataFrameEquality(sourceDF, expectedDF)` will throw a `DatasetContentMismatch` error.\n\nWe can set the `orderedComparison` boolean flag to `false` and spark-fast-tests will sort the DataFrames before\nperforming the comparison.\n\n`assertSmallDataFrameEquality(sourceDF, expectedDF, orderedComparison = false)` will not throw an error.\n\n#### Equality comparisons ignoring the nullable flag\n\nYou might also want to make equality comparisons that ignore the nullable flags for the DataFrame columns.\n\nHere is how to use the `ignoreNullable` flag to compare DataFrames without considering the nullable property of each\ncolumn.\n\n```scala\nval sourceDF = spark.createDF(\n  List(\n    (1),\n    (5)\n  ), List(\n    (\"number\", IntegerType, false)\n  )\n)\n\nval expectedDF = spark.createDF(\n  List(\n    (1),\n    (5)\n  ), List(\n    (\"number\", IntegerType, true)\n  )\n)\n\nassertSmallDatasetEquality(sourceDF, expectedDF, ignoreNullable = true)\n```\n\n#### Approximate DataFrame Equality\n\nThe `assertApproximateDataFrameEquality` function is useful for DataFrames that contain `DoubleType` columns. The\nprecision threshold must be set when using the `assertApproximateDataFrameEquality` function.\n\n```scala\nval sourceDF = spark.createDF(\n  List(\n    (1.2),\n    (5.1),\n    (null)\n  ), List(\n    (\"number\", DoubleType, true)\n  )\n)\n\nval expectedDF = spark.createDF(\n  List(\n    (1.2),\n    (5.1),\n    (null)\n  ), List(\n    (\"number\", DoubleType, true)\n  )\n)\n\nassertApproximateDataFrameEquality(sourceDF, expectedDF, 0.01)\n```\n\n### Column Equality\n\nThe `assertColumnEquality` method can be used to assess the equality of two columns in a DataFrame.\n\nSuppose you have the following DataFrame with two columns that are not equal.\n\n```\n+-------+-------------+\n|   name|expected_name|\n+-------+-------------+\n|   phil|         phil|\n| rashid|       rashid|\n|matthew|        mateo|\n|   sami|         sami|\n|     li|         feng|\n|   null|         null|\n+-------+-------------+\n```\n\nThe following code will throw a `ColumnMismatch` error message:\n\n```scala\nassertColumnEquality(df, \"name\", \"expected_name\")\n```\n\n\u003cp\u003e\n    \u003cimg src=\"./images/assertColumnEquality_error_message.png\" alt=\"Description\" width=\"500\", height=\"200\"\u003e\n\u003c/p\u003e\n\nMix in the `ColumnComparer` trait to your test class to access the `assertColumnEquality` method:\n\n```scala\nimport com.github.mrpowers.spark.fast.tests.ColumnComparer\n\nobject MySpecialClassTest\n  extends TestSuite\n    with ColumnComparer\n    with SparkSessionTestWrapper {\n\n  // your tests\n}\n```\n\n### Schema Equality\n\nThe SchemaComparer provide `assertSchemaEqual` API which is useful for comparing schema of dataframe schema\n\nConsider the following two schemas:\n\n```scala\nval s1 = StructType(\n  Seq(\n    StructField(\"array\", ArrayType(StringType, containsNull = true), true),\n    StructField(\"map\", MapType(StringType, StringType, valueContainsNull = false), true),\n    StructField(\"something\", StringType, true),\n    StructField(\n      \"struct\",\n      StructType(\n        StructType(\n          Seq(\n            StructField(\"mood\", ArrayType(StringType, containsNull = false), true),\n            StructField(\"something\", StringType, false),\n            StructField(\n              \"something2\",\n              StructType(\n                Seq(\n                  StructField(\"mood2\", ArrayType(DoubleType, containsNull = false), true),\n                  StructField(\"something2\", StringType, false)\n                )\n              ),\n              false\n            )\n          )\n        )\n      ),\n      true\n    )\n  )\n)\nval s2 = StructType(\n  Seq(\n    StructField(\"array\", ArrayType(StringType, containsNull = true), true),\n    StructField(\"something\", StringType, true),\n    StructField(\"map\", MapType(StringType, StringType, valueContainsNull = false), true),\n    StructField(\n      \"struct\",\n      StructType(\n        StructType(\n          Seq(\n            StructField(\"something\", StringType, false),\n            StructField(\"mood\", ArrayType(StringType, containsNull = false), true),\n            StructField(\n              \"something3\",\n              StructType(\n                Seq(\n                  StructField(\"mood3\", ArrayType(StringType, containsNull = false), true)\n                )\n              ),\n              false\n            )\n          )\n        )\n      ),\n      true\n    ),\n    StructField(\"norma2\", StringType, false)\n  )\n)\n\n```\n\nThe `assertSchemaEqual` support two output format `SchemaDiffOutputFormat.Tree` and `SchemaDiffOutputFormat.Table`. Tree\noutput\nformat is useful when the schema is large and contains multi level nested fields.\n\n```scala\nSchemaComparer.assertSchemaEqual(s1, s2, ignoreColumnOrder = false, outputFormat = SchemaDiffOutputFormat.Tree)\n```\n\n\u003cp\u003e\n    \u003cimg src=\"./images/assertSchemaEquality_tree_message.png\" alt=\"assert_column_equality_error_message\" width=\"600\", height=\"200\"\u003e\n\u003c/p\u003e\n\nBy default `SchemaDiffOutputFormat.Table` is used internally by all dataframe/dataset comparison APIs.\n\n## Testing Tips\n\n* Use column functions instead of UDFs as described\n  in [this blog post](https://medium.com/@mrpowers/spark-user-defined-functions-udfs-6c849e39443b)\n* Try to organize your code\n  as [custom transformations](https://medium.com/@mrpowers/chaining-custom-dataframe-transformations-in-spark-a39e315f903c)\n  so it's easy to test the logic elegantly\n* Don't write tests that read from files or write files. Dependency injection is a great way to avoid file I/O in you\n  test suite.\n\n## Alternatives\n\nThe [spark-testing-base](https://github.com/holdenk/spark-testing-base) project has more features (e.g. streaming\nsupport) and is compiled to support a variety of Scala and Spark versions.\n\nYou might want to use spark-fast-tests instead of spark-testing-base in these cases:\n\n* You want to use uTest or a testing framework other than scalatest\n* You want to run tests in parallel (you need to set `parallelExecution in Test := false` with spark-testing-base)\n* You don't want to include hive as a project dependency\n* You don't want to restart the SparkSession after each test file executes so the suite runs faster\n\n## Publishing\n\nGPG \u0026 Sonatype need to be setup properly before running these commands. See\nthe [spark-daria](https://github.com/MrPowers/spark-daria) README for more information.\n\nIt's a good idea to always run `clean` before running any publishing commands. It's also important to run `clean` before\ndifferent publishing commands as well.\n\nThere is a two step process for publishing.\n\nGenerate Scala 2.11 JAR files:\n\n* Run `sbt -Dspark.version=2.4.8`\n* Run `\u003e ; + publishSigned; sonatypeBundleRelease` to create the JAR files and release them to Maven.\n\nGenerate Scala 2.12 \u0026 Scala 2.13 JAR files:\n\n* Run `sbt`\n* Run `\u003e ; + publishSigned; sonatypeBundleRelease`\n\nThe `publishSigned` and `sonatypeBundleRelease` commands are made available by\nthe [sbt-sonatype](https://github.com/xerial/sbt-sonatype) plugin.\n\nWhen the release command is run, you'll be prompted to enter your GPG passphrase.\n\nThe Sonatype credentials should be stored in the `~/.sbt/sonatype_credentials` file in this format:\n\n```\nrealm=Sonatype Nexus Repository Manager\nhost=oss.sonatype.org\nuser=$USERNAME\npassword=$PASSWORD\n```\n\n## Additional Goals\n\n* Use memory efficiently so Spark test runs don't crash\n* Provide readable error messages\n* Easy to use in conjunction with other test suites\n* Give the user control of the SparkSession\n\n## Contributing\n\nOpen an issue or send a pull request to contribute. Anyone that makes good contributions to the project will be promoted\nto project maintainer status.\n\n## uTest settings to display color output\n\nCreate a `CustomFramework` class with overrides that turn off the default uTest color settings.\n\n```scala\npackage com.github.mrpowers.spark.fast.tests\n\nclass CustomFramework extends utest.runner.Framework {\n  override def formatWrapWidth: Int = 300\n\n  // turn off the default exception message color, so spark-fast-tests\n  // can send messages with custom colors\n  override def exceptionMsgColor = toggledColor(utest.ufansi.Attrs.Empty)\n\n  override def exceptionPrefixColor = toggledColor(utest.ufansi.Attrs.Empty)\n\n  override def exceptionMethodColor = toggledColor(utest.ufansi.Attrs.Empty)\n\n  override def exceptionPunctuationColor = toggledColor(utest.ufansi.Attrs.Empty)\n\n  override def exceptionLineNumberColor = toggledColor(utest.ufansi.Attrs.Empty)\n}\n```\n\nUpdate the `build.sbt` file to use the `CustomFramework` class:\n\n```scala\ntestFrameworks += new TestFramework(\"com.github.mrpowers.spark.fast.tests.CustomFramework\")\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers-io%2Fspark-fast-tests","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrpowers-io%2Fspark-fast-tests","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrpowers-io%2Fspark-fast-tests/lists"}