{"id":18810390,"url":"https://github.com/absaoss/spark-hats","last_synced_at":"2025-04-13T20:30:57.733Z","repository":{"id":41867999,"uuid":"231356164","full_name":"AbsaOSS/spark-hats","owner":"AbsaOSS","description":"Nested array transformation helper extensions for Apache Spark","archived":false,"fork":false,"pushed_at":"2023-08-04T08:53:31.000Z","size":155,"stargazers_count":35,"open_issues_count":6,"forks_count":4,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-04-12T07:05:55.641Z","etag":null,"topics":["arrays","nested-structures","scala","schema","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-02T10:12:04.000Z","updated_at":"2024-01-25T11:44:41.000Z","dependencies_parsed_at":"2022-08-11T19:40:46.117Z","dependency_job_id":null,"html_url":"https://github.com/AbsaOSS/spark-hats","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-hats","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-hats/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-hats/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-hats/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/spark-hats/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223603274,"owners_count":17172073,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arrays","nested-structures","scala","schema","spark"],"created_at":"2024-11-07T23:20:03.679Z","updated_at":"2024-11-07T23:20:04.444Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-hats\n[![Build](https://github.com/AbsaOSS/spark-hats/workflows/Build/badge.svg)](https://github.com/AbsaOSS/spark-hats/actions)\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FAbsaOSS%2Fspark-hats.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2FAbsaOSS%2Fspark-hats?ref=badge_shield)\n\nSpark \"**H**elpers for **A**rray **T**ransformation**s**\"\n\nThis library extends Spark DataFrame API with helpers for transforming fields inside nested structures and arrays of\narbitrary levels of nesting.\n\n## Usage\n\nReference the library\n\n\u003ctable\u003e\n\u003ctr\u003e\u003cth\u003eScala 2.11\u003c/th\u003e\u003cth\u003eScala 2.12\u003c/th\u003e\u003cth\u003eScala 2.13\u003c/th\u003e\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-hats_2.11\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-hats_2.11/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-hats_2.12\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-hats_2.12/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-hats_2.13\"\u003e\u003cimg src = \"https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-hats_2.13/badge.svg\" alt=\"Maven Central\"\u003e\u003c/a\u003e\u003cbr\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\n\u003cpre\u003egroupId: za.co.absa\u003cbr\u003eartifactId: spark-hats_2.11\u003cbr\u003eversion: 0.3.0\u003c/pre\u003e\n\u003c/td\u003e\n\u003ctd\u003e\n\u003cpre\u003egroupId: za.co.absa\u003cbr\u003eartifactId: spark-hats_2.12\u003cbr\u003eversion: 0.3.0\u003c/pre\u003e\n\u003c/td\u003e\n\u003ctd\u003e\n\u003cpre\u003egroupId: za.co.absa\u003cbr\u003eartifactId: spark-hats_2.13\u003cbr\u003eversion: 0.3.0\u003c/pre\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\nPlease, use the table below to determine what version of spark-hats to use for Spark compatibility.\n\n| spark-hats version | Scala version | Spark version |\n|:------------------:|:-------------:|:-------------:|\n|       0.1.x        |  2.11, 2.12   |    2.4.3+     |\n|       0.2.x        |  2.11, 2.12   |    2.4.3+     |\n|       0.2.x        |     2.12      |    3.0.0+     |\n|       0.3.x        |     2.11      |    2.4.3+     |\n|       0.3.x        |  2.12, 2.13   |    3.2.1+     |\n\nTo use the extensions you need to add this import to your Spark application or shell:\n```scala\nimport za.co.absa.spark.hats.Extensions._\n```\n\n### How to generate Code coverage report\n```\nsbt ++{matrix.scala} jacoco -DSPARK_VERSION={matrix.spark}\n```\nCode coverage will be generated on path:\n```\n{project-root}/spark-hats/target/scala-{scala_version}/jacoco/report/html\n```\n\n\n## Motivation\n\nHere is a small example we will use to show you how `spark-hats` work. The important thing is that the dataframe\ncontains an array of struct fields.\n\n```scala\nscala\u003e df.printSchema()\nroot\n |-- id: long (nullable = true)\n |-- my_array: array (nullable = true)\n |    |-- element: struct (containsNull = true)\n |    |    |-- a: long (nullable = true)\n |    |    |-- b: string (nullable = true)\n       \nscala\u003e df.show(false)\n+---+------------------------------+\n|id |my_array                      |\n+---+------------------------------+\n|1  |[[1, foo]]                    |\n|2  |[[1, bar], [2, baz], [3, foz]]|\n+---+------------------------------+\n```\n\nNow, say, we want to add a field `c` as part of the struct alongside `a` and `b` from the example above. The\nexpression for `c` is `c = a + 1`.\n\nHere is the code you can use in Spark:\n```scala\n    val dfOut = df.select(col(\"id\"), transform(col(\"my_array\"), c =\u003e {\n      struct(c.getField(\"a\").as(\"a\"),\n        c.getField(\"b\").as(\"b\"),\n        (c.getField(\"a\") + 1).as(\"c\"))\n    }).as(\"my_array\"))\n\n```\n(to use `transform()` in Scala API you need to add [spark-hofs](https://github.com/AbsaOSS/spark-hofs) as a dependency).\n\nHere is how it looks when using `spark-hats` library. \n```scala\n    val dfOut = df.nestedMapColumn(\"my_array.a\",\"c\", a =\u003e a + 1)\n```\n\nBoth produce the following results:\n```scala\nscala\u003e dfOut.printSchema\nroot\n |-- id: long (nullable = true)\n |-- my_array: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- a: long (nullable = true)\n |    |    |-- b: string (nullable = true)\n |    |    |-- c: long (nullable = true)\n\nscala\u003e dfOut.show(false)\n+---+---------------------------------------+\n|id |my_array                               |\n+---+---------------------------------------+\n|1  |[[1, foo, 2]]                          |\n|2  |[[1, bar, 2], [2, baz, 3], [3, foz, 4]]|\n+---+---------------------------------------+\n```\n\nImagine how the code will look like for more levels of array nesting.\n\n## Methods\n\n### Add a column\nThe `nestedWithColumn` method allows adding new fields inside nested structures and arrays.\n\nThe addition of a column API is provided in two flavors: the basic and the extended API. The basic API is simpler to\nuse, but the expressions it expects can only reference columns at the root of the schema. Here is an example of the basic add\ncolumn API:\n\n```scala\nscala\u003e df.nestedWithColumn(\"my_array.c\", lit(\"hello\")).printSchema\nroot\n |-- id: long (nullable = true)\n |-- my_array: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- a: long (nullable = true)\n |    |    |-- b: string (nullable = true)\n |    |    |-- c: string (nullable = false)\n\nscala\u003e df.nestedWithColumn(\"my_array.c\", lit(\"hello\")).show(false)\n+---+---------------------------------------------------+\n|id |my_array                                           |\n+---+---------------------------------------------------+\n|1  |[[1, foo, hello]]                                  |\n|2  |[[1, bar, hello], [2, baz, hello], [3, foz, hello]]|\n+---+---------------------------------------------------+\n```\n\n### Add column (extended)\nThe extended API method `nestedWithColumnExtended` works similarly to the basic one but allows the caller to reference\nother array elements, possibly on different levels of nesting. The way it allows this is a little tricky.\nThe second parameter is changed from being a column to a *function that returns a column*. Moreover, this function has\nan argument which is a function itself, the `getField()` function. The `getField()` function can be used in the\ntransformation to reference other columns in the dataframe by their fully qualified name.\n\nIn the following example, a transformation adds a new field `my_array.c` to the dataframe by concatenating a root\nlevel column `id` with a nested field `my_array.b`:\n\n```scala\nscala\u003e val dfOut = df.nestedWithColumnExtended(\"my_array.c\", getField =\u003e\n         concat(getField(\"id\").cast(\"string\"), getField(\"my_array.b\"))\n       )\n\nscala\u003e dfOut.printSchema\nroot\n |-- id: long (nullable = true)\n |-- my_array: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- a: long (nullable = true)\n |    |    |-- b: string (nullable = true)\n |    |    |-- c: string (nullable = true)\n\nscala\u003e dfOut.show(false)\n+---+------------------------------------------------+\n|id |my_array                                        |\n+---+------------------------------------------------+\n|1  |[[1, foo, 1foo]]                                |\n|2  |[[1, bar, 2bar], [2, baz, 2baz], [3, foz, 2foz]]|\n+---+------------------------------------------------+\n```\n\n* **Note.** You can still use `col` to reference root level columns. But if a column is inside an array (like\n`my_array.b`), invoking `col(\"my_array.b\")` will reference the whole array, not an individual element. The `getField()`\nfunction that is passed to the transformation solves this by adding a generic way of addressing array elements on arbitrary\nlevels of nesting.\n\n* **Advanced Note.** If there are several arrays in the schema, `getField()` allows to reference elements of an array\nif it is one of the parents of the output column.\n\n\n### Drop a column\nThe `nestedDropColumn` method allows dropping fields inside nested structures and arrays.\n\n\n```scala\nscala\u003e df.nestedDropColumn(\"my_array.b\").printSchema\nroot\n |-- id: long (nullable = true)\n |-- my_array: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- a: long (nullable = true)\n\nscala\u003e df.nestedDropColumn(\"my_array.b\").show(false)\n+---+---------------+\n|id |my_array       |\n+---+---------------+\n|1  |[[1]]          |\n|2  |[[1], [2], [3]]|\n+---+---------------+\n```\n\n### Map a column\n\nThe `nestedMapColumn` method applies a transformation on a nested field. If the input column is a primitive field the\nmethod will add `outputColumnName` at the same level of nesting. If a struct column is expected you can use\n`.getField(...)` method to operate on its children.\n\nThe output column name can omit the full path as the field will be created at the same level of nesting as the input column.\n\n```scala\nscala\u003e df.nestedMapColumn(inputColumnName = \"my_array.a\", outputColumnName = \"c\", expression = a =\u003e a + 1).printSchema\nroot\n |-- id: long (nullable = true)\n |-- my_array: array (nullable = true)\n |    |-- element: struct (containsNull = false)\n |    |    |-- a: long (nullable = true)\n |    |    |-- b: string (nullable = true)\n |    |    |-- c: long (nullable = true)\n\nscala\u003e df.nestedMapColumn(inputColumnName = \"my_array.a\", outputColumnName = \"c\", expression = a =\u003e a + 1).show(false)\n+---+---------------------------------------+\n|id |my_array                               |\n+---+---------------------------------------+\n|1  |[[1, foo, 2]]                          |\n|2  |[[1, bar, 2], [2, baz, 3], [3, foz, 4]]|\n+---+---------------------------------------+\n```\n\n## Other transformations\n\n### Unstruct\n\nSyntax: `df.nestedUnstruct(\"NestedStructColumnName\")`.\n\nFlattens one level of nesting when a struct is nested in another struct. For example,\n\n```scala\nscala\u003e df.printSchema\nroot\n|-- id: long (nullable = true)\n|-- my_array: array (nullable = true)\n|    |-- element: struct (containsNull = true)\n|    |    |-- a: long (nullable = true)\n|    |    |-- b: string (nullable = true)\n|    |    |-- c: struct (containsNull = true)\n|    |    |    |--nestedField1: string (nullable = true)\n|    |    |    |--nestedField2: long (nullable = true)\n\nscala\u003e df.nestedUnstruct(\"my_array.c\").printSchema\nroot\n|-- id: long (nullable = true)\n|-- my_array: array (nullable = true)\n|    |-- element: struct (containsNull = true)\n|    |    |-- a: long (nullable = true)\n|    |    |-- b: string (nullable = true)\n|    |    |-- nestedField1: string (nullable = true)\n|    |    |-- nestedField2: long (nullable = true)\n```\n\nNote that the output schema doesn't have the `c` struct. All fields of `c` are now part of the parent struct. \n\n## Changelog\n- #### 0.3.0 released 3 August 2023.\n  - [#38](https://github.com/AbsaOSS/spark-hats/issues/38) Add scala 2.13 support.\n  - [#33](https://github.com/AbsaOSS/spark-hats/issues/33) Update spark test to 3.2.1.\n  - [#35](https://github.com/AbsaOSS/spark-hats/issues/35) Add code coverage support.\n\n- #### 0.2.2 released 8 March 2021.\n  - [#23](https://github.com/AbsaOSS/spark-hats/issues/23) Added `nestedUnstruct()` method that flattens one level of nesting for a given struct.\n\n- #### 0.2.1 released 21 January 2020.\n  - [#10](https://github.com/AbsaOSS/spark-hats/issues/10) Fixed error column aggregation when the input array is `null`.\n  \n- #### 0.2.0 released 16 January 2020.\n  - [#5](https://github.com/AbsaOSS/spark-hats/issues/5) Added the extended nested transformation API that allows referencing arbitrary columns.\n\n\n## License\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2FAbsaOSS%2Fspark-hats.svg?type=large)](https://app.fossa.com/projects/git%2Bgithub.com%2FAbsaOSS%2Fspark-hats?ref=badge_large)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-hats","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fspark-hats","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-hats/lists"}