{"id":18810382,"url":"https://github.com/absaoss/spark-commons","last_synced_at":"2025-04-13T20:31:01.655Z","repository":{"id":37810955,"uuid":"434170702","full_name":"AbsaOSS/spark-commons","owner":"AbsaOSS","description":null,"archived":false,"fork":false,"pushed_at":"2023-08-25T06:52:25.000Z","size":195,"stargazers_count":7,"open_issues_count":8,"forks_count":0,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-04-12T07:05:56.044Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null}},"created_at":"2021-12-02T10:10:41.000Z","updated_at":"2024-01-18T18:16:07.000Z","dependencies_parsed_at":"2023-02-17T20:45:50.697Z","dependency_job_id":null,"html_url":"https://github.com/AbsaOSS/spark-commons","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-commons","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-commons/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-commons/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-commons/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/spark-commons/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223603268,"owners_count":17172072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:20:02.377Z","updated_at":"2024-11-07T23:20:03.238Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-commons\n\n[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)\n[![Build](https://github.com/AbsaOSS/spark-commons/actions/workflows/build.yml/badge.svg)](https://github.com/AbsaOSS/spark-commons/actions/workflows/build.yml)\n[![Release](https://github.com/AbsaOSS/spark-commons/actions/workflows/release.yml/badge.svg)](https://github.com/AbsaOSS/spark-commons/actions/workflows/release.yml)\n\n`spark-commons` is a library offering commonly needed routines, classes and functionality. It consists of three modules.\n* spark-commons-spark2.4\n* spark-commons-spark3.2\n* spark-commons-spark3.3\n* spark-commons-test\n\n**spark2-commons** and **spark3-commons** both offer the same logic for the respective major versions of Spark addressing\nusual needs of Spark applications.\n\n**spark-commons-test** then brings routines to help in testing Spark applications (and it's independent of Spark \nversion used) \n\n\n|              | spark-commons-spark2.4                                                                                                                                                                        | spark-commons-spark3.2                                                                                                                                                                                         | spark-commons-spark3.3                                                                                                                                                                                         | spark-commons-test                                                                                                                                                                                     |\n|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| _Scala 2.11_ | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark2.4_2.11/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark2.4_2.11) |                                                                                                                                                                                                                |                                                                                                                                                                                                                | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-test_2.11/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-test_2.11) | \n| _Scala 2.12_ | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark2.4_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark2.4_2.12)  | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark3.2_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark3.2_2.12) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark3.3_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-spark3.3_2.12) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-test_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-commons-test_2.12) | \n\n## Spark-Commons\n\n### NonFatalQueryExecutionListenerAdapter\n\nA trait that when is mixed with another `QueryExecutionListener` implementation, \nmakes sure the later is not called with any fatal exception.   \n\nSee https://github.com/AbsaOSS/commons/issues/50\n\n```scala\nval myListener = new MyQueryExecutionListener with NonFatalQueryExecutionListenerAdapter\nspark.listenerManager.register(myListener)\n```\n\n### TransformAdapter\n\nA trait that brings Spark version independent implementation of `transform` function.\n\n### SchemaUtils\n\n_SchemaUtils_ provides methods for working with schemas, its comparison and alignment.  \n\n1. Extracts the parent path of a field. Returns an empty string if a root level column name is provided.\n\n    ```scala\n      SchemaUtils.getParentPath(columnName)\n    ```\n\n2. Extracts the field name of a fully qualified column name.\n\n    ```scala\n      SchemaUtils.stripParentPath(columnName)\n    ```\n\n\n3. Get paths for all array subfields of this given datatype.\n\n    ```scala\n      SchemaUtils.getAllArraySubPaths(other)\n    ```\n\n4. For a given list of field paths determines if any path pair is a subset of one another.\n\n    ```scala\n      SchemaUtils.isCommonSubPath(paths)\n    ```\n\n5. Append a new attribute to path or empty string.\n\n    ```scala\n      SchemaUtils.appendPath(path, fieldName)\n    ```\n\n5. Separates the field name components of a fully qualified column name as their hierarchy goes from root down to the\ndeepest one.\n\n    ```scala\n      SchemaUtils.splitPath(columnName, keepEmptyFields = True)\n    ```\n\n\n### JsonUtils\n\n_Json Utils_ provides methods for working with Json, both on input and output.\n\n1. Create a Spark DataFrame from a JSON document(s).\n\n    ```scala\n      JsonUtils.getDataFrameFromJson(json)\n      JsonUtils.getDataFrameFromJson(json, schema)(implicit spark)\n    ```\n\n2. Creates a Spark Schema from a JSON document(s).\n\n    ```scala\n      JsonUtils.getSchemaFromJson(json)\n    ```\n   \n### ColumnImplicits\n\n_ColumnImplicits_ provide implicit methods for transforming Spark Columns\n\n1. Transforms the column into a boolean column, checking if values are negative or positive infinity\n\n    ```scala\n      column.isInfinite()\n    ```\n2. Returns column with requested substring. It shifts the substring indexation to be in accordance with Scala/ Java. \n    The provided starting position where to start the substring from, if negative it will be counted from end\n\n    ```scala\n      column.zeroBasedSubstr(startPos)\n    ```\n   \n3. Returns column with requested substring. It shifts the substring indexation to be in accordance with Scala/ Java. \n   If the provided starting position where to start the substring from is negative, it will be counted from end. \n   The length of the desired substring, if longer then the rest of the string, all the remaining characters are taken.\n\n    ```scala\n      column.zeroBasedSubstr(startPos, length)\n    ```\n\n### StructFieldImplicits\n\n_StructFieldImplicits_ provides implicit methods for working with StructField objects.  \n\nOf them, metadata methods are:\n\n1. Gets the metadata Option[String] value given a key\n\n    ```scala\n      structField.metadata.getOptString(key)\n    ```\n   \n2. Gets the metadata Char value given a key if the value is a single character String, it returns the char,\n otherwise None\n\n    ```scala\n      structField.metadata.getOptChar(key)\n    ```\n  \n3. Gets the metadata boolean value of a given key, given that it can be transformed into boolean\n\n    ```scala\n      structField.metadata.getStringAsBoolean(key)\n    ```\n\n4. Checks the structfield if it has the provided key, returns a boolean\n\n    ```scala\n      structField.metadata.hasKey(key)\n    ```\n   \n### ArrayTypeImplicits\n\n_ArrayTypeImplicits_ provides implicit methods for working with ArrayType objects.  \n\n\n1. Checks if the arraytype is equivalent to another\n\n    ```scala\n      arrayType.isEquivalentArrayType(otherArrayType)\n    ```   \n\n2. For an array of arrays, get the final element type at the bottom of the array\n\n    ```scala\n      arrayType.getDeepestArrayType()\n    ```   \n   \n### DataTypeImplicits\n\n_DataTypeImplicits_ provides implicit methods for working with DataType objects.  \n\n\n1. Checks if the datatype is equivalent to another\n\n    ```scala\n      dataType.isEquivalentDataType(otherDt)\n    ```   \n\n2. Checks if a casting between types always succeeds\n\n    ```scala\n      dataType.doesCastAlwaysSucceed(otherDt)\n    ```   \n3. Checks if type is primitive\n\n    ```scala\n      dataType.isPrimitive()\n    ```\n   \n### StructTypeImplicits\n\n_StructTypeImplicits_ provides implicit methods for working with StructType objects.  \n\n\n1. Get a field from a text path\n\n    ```scala\n      structType.getField(path)\n    ```\n2. Get a type of a field from a text path\n\n    ```scala\n      structType.getFieldType(path)\n    ```\n3. Checks if the specified path is an array of structs\n\n    ```scala\n      structType.isColumnArrayOfStruct(path)\n    ```\n\n4. Get nullability of a field from a text path\n\n    ```scala\n      structType.getFieldNullability(path)\n    ```\n\n5. Checks if a field specified by a path exists\n\n    ```scala\n      structType.fieldExists(path)\n    ```\n    \n6. Get paths for all array fields in the schema\n\n    ```scala\n      structType.getAllArrayPaths()\n    ```\n    \n7. Get a closest unique column name\n\n    ```scala\n      structType.getClosestUniqueName(desiredName)\n    ```\n\n8. Checks if a field is the only field in a struct\n\n    ```scala\n      structType.isOnlyField(columnName)\n    ```\n9. Checks if 2 structtypes are equivalent\n\n    ```scala\n      structType.isEquivalent(other)\n    ```\n\n10. Returns a list of differences in one utils to the other\n\n    ```scala\n      structType.diffSchema(otherSchema, parent)\n    ```\n\n11. Checks if a field is of the specified type\n\n    ```scala\n      structType.isOfType[ArrayType](path)\n    ```\n12. Checks if a field is  a subset of the specified type\n\n    ```scala\n          structType.isSubset(other)\n     ```\n    \n13. Returns data selector that can be used to align utils of a data frame.\n\n    ```scala\n          structType.getDataFrameSelector()\n    ```\n    \n### StructTypeArrayImplicits\n\n1. Get first array column's path out of complete path\n\n    ```scala\n      structType.getFirstArrayPath(path)\n    ```\n   \n2. Get all array columns' paths out of complete path.\n\n    ```scala\n      structType.getAllArraysInPath(path)\n    ```\n   \n3. For a given list of field paths determines the deepest common array path\n\n    ```scala\n      structType.getDeepestCommonArrayPath(fieldPaths)\n    ```\n\n4. For a field path determines the deepest array path\n\n    ```scala\n      structType.getDeepestArrayPath(path)\n    ```\n   \n5. Checks if a field is an array that is not nested in another array\n\n    ```scala\n      structType.isNonNestedArray(path)\n    ```\n\n### DataFrameImplicits\n\n1. Changes the fields structure of the DataFrame to adhere to the provided schema or selector. Data types remain intact\n\n ```scala\n   dataFrame.alignSchema\n ```\n\n2. Persist this Dataset with the default storage level, avoiding the warning in case the cache has happened already\n   before\n\n ```scala\n   dataFrame.cacheIfNotCachedYet()\n ```\n\n3. Get the string representation of the data in the format as `Dataset.show()`]]` displays them\n\n ```scala\n   dataFrame.dataAsString()\n ```\n\n4. Adds a column to a dataframe if it does not exist\n\n ```scala\n   dataFrame.withColumnIfDoesNotExist(path)\n ```\n\n5. Casts all `NullType` fields of the DataFrame to their corresponding types in targetSchema.\n\n ```scala\n   dataFrame.enforceTypeOnNullTypeFields(targetSchema)\n ```\n\n\n### Spark Version Guard\n\nA class which checks if the Spark job version is compatible with the Spark Versions supported by the library\n\nDefault mode checking\n```scala\nSparkVersionGuard.fromDefaultSparkCompatibilitySettings.ensureSparkVersionCompatibility(SPARK_VERSION)\n```\n\nChecking for 2.X versions\n```scala\nSparkVersionGuard.fromSpark2XCompatibilitySettings.ensureSparkVersionCompatibility(SPARK_VERSION)\n```\n\nChecking for 3.X versions\n```scala\nSparkVersionGuard.fromSpark3XCompatibilitySettings.ensureSparkVersionCompatibility(SPARK_VERSION)\n```\n\n### OncePerSparkSession\n\nAbstract class to help attach/register UDFs and similar object only once to a spark session.\n\n\n_Usage:_ Extend this abstract class and implement the method `register`. On initialization the `register` method gets \nexecuted only if the class + spark session combination is unique. \n\nThis way we ensure only single registration per spark session.\n\n### DataFrameImplicits\n_DataFrameImplicits_ provides methods for transformations on Dataframes  \n\n1. Getting the string of the data of the dataframe in similar fashion as the `show` function present them.\n\n    ```scala\n          df.dataAsString() \n      \n          df.dataAsString(truncate)\n      \n          df.dataAsString(numRows, truncate)\n   \n          df.dataAsString(numRows, truncateNumber)\n      \n          df.dataAsString(numRows, truncate, vertical)\n    ```\n    \n2. Adds a column to a dataframe if it does not exist. If it exists, it will apply the provided function\n    \n   ```scala\n      df.withColumnIfDoesNotExist((df: DataFrame, _) =\u003e df)(colName, colExpression)\n   ```\n\n3. Aligns the utils of a DataFrame to the selector for operations\n   where utils order might be important (e.g. hashing the whole rows and using except)\n\n   ```scala\n      df.alignSchema(structType)\n   ```\n   \n   ```scala\n      df.alignSchema(listColumns)\n   ```\n\n## Functions\n\n1. Similarly to `col` function evaluates the column based on the provided column name. But here, it can be a full\npath even of nested fields. It also evaluates arrays and maps where the array index or map key is in brackets `[]`.\n\n   ```scala\n       def col_of_path(fullColName: String): Column\n   ```\n\n2. Provides a column of NULL values.\n\n   ```scala\n       def nul_coll(): Column\n   ```\n\n\n3. Provides a column of NULL values, but the actual type is per specification\n\n   ```scala\n       def nul_coll(dataType: DataType): Column\n   ```\n   \n## Error Handler\n\nA `trait` and a set of supporting classes and other traits to enable errors channeling between libraries and \napplication during Spark data processing.\n\n1. It has an [implicit dataFrame](https://github.com/AbsaOSS/spark-commons/blob/113-Rename-ErrorHandling-to-ErrorHandler/spark-commons/src/main/scala/za/co/absa/spark/commons/errorhandler/DataFrameErrorHandlerImplicit.scala) for easier usage of the methods provided by the error handler trait.\n\n2. It provides four basic implementations\n   * [ErrorHandlerErrorMessageIntoArray](https://github.com/AbsaOSS/spark-commons/blob/113-Rename-ErrorHandling-to-ErrorHandler/spark-commons/src/main/scala/za/co/absa/spark/commons/errorhandler/implementations/ErrorHandlerErrorMessageIntoArray.scala) - An implementation of error handler trait that collects errors into columns of struct based on [za.co.absa.spark.commons.errorhandler.ErrorMessage ErrorMessage] case class.\n   * [ErrorHandlerFilteringErrorRows](https://github.com/AbsaOSS/spark-commons/blob/113-Rename-ErrorHandling-to-ErrorHandler/spark-commons/src/main/scala/za/co/absa/spark/commons/errorhandler/implementations/ErrorHandlerFilteringErrorRows.scala) - An implementation of error handler that implements the functionality of filtering rows that have some error (any of the error columns is not NULL).\n   * [ErrorHandlerIgnoringErrors](https://github.com/AbsaOSS/spark-commons/blob/113-Rename-ErrorHandling-to-ErrorHandler/spark-commons/src/main/scala/za/co/absa/spark/commons/errorhandler/implementations/ErrorHandlerIgnoringErrors.scala) -  An implementation of error handler trait that ignores the errors detected during the dataFrame error aggregation\n   * [ErrorHandlerThrowingException](https://github.com/AbsaOSS/spark-commons/blob/113-Rename-ErrorHandling-to-ErrorHandler/spark-commons/src/main/scala/za/co/absa/spark/commons/errorhandler/implementations/ErrorHandlerThrowingException.scala) - An implementation of error handler trait that throws an exception on error detected.\n\n## Spark Commons Test\n\n### Usage:\n\n```scala\nclass MyTest extends SparkTestBase {\n}\n```\n\nBy default, it will instantiate a local Spark.\nThere is also the possibility to use it in yarn mode:\n\n```scala\nclass MyTest extends SparkTestBase {\noverride lazy val spark: SparkSession = initSpark(new YarnSparkConfiguration(confDir, distJarsDir))\n}\n```\n\n## How to generate Code coverage report\n```sbt\nsbt jacoco\n```\nCode coverage will be generated on path:\n```\n{project-root}/spark-commons/target/spark{spark_version}-jvm-{scala_version}/jacoco/report/html\n{project-root}/spark-commons-test/target/jvm-{scala_version}/jacoco/report/html\n```\n\n\n## How to Release\n\nPlease see [this file](RELEASE.md) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-commons","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fspark-commons","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-commons/lists"}