{"id":17753638,"url":"https://github.com/G-Research/spark-extension","last_synced_at":"2025-03-15T04:31:19.755Z","repository":{"id":36968961,"uuid":"243452117","full_name":"G-Research/spark-extension","owner":"G-Research","description":"A library that provides useful extensions to Apache Spark and PySpark.","archived":false,"fork":false,"pushed_at":"2024-09-13T14:00:32.000Z","size":794,"stargazers_count":193,"open_issues_count":17,"forks_count":26,"subscribers_count":21,"default_branch":"master","last_synced_at":"2024-09-22T15:11:49.340Z","etag":null,"topics":["gr-oss","java","pyspark","python","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/G-Research.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-02-27T06:55:16.000Z","updated_at":"2024-08-29T03:40:00.000Z","dependencies_parsed_at":"2024-05-31T07:48:30.264Z","dependency_job_id":"56d45774-9382-4c45-a51a-1705136f132c","html_url":"https://github.com/G-Research/spark-extension","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/G-Research","download_url":"https://codeload.github.com/G-Research/spark-extension/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243685506,"owners_count":20330980,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gr-oss","java","pyspark","python","scala","spark"],"created_at":"2024-10-26T14:00:49.249Z","updated_at":"2025-03-15T04:31:19.747Z","avatar_url":"https://github.com/G-Research.png","language":"Scala","funding_links":[],"categories":["Scala"],"sub_categories":[],"readme":"# Spark Extension\n\nThis project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala and Python:\n\n**[Diff](DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between\ntwo datasets, i.e. which rows to _add_, _delete_ or _change_ to get from one dataset to the other.\n\n**[SortedGroups](GROUPS.md):** A `groupByKey` transformation that groups rows by a key while providing\na **sorted** iterator for each group. Similar to `Dataset.groupByKey.flatMapGroups`, but with order guarantees\nfor the iterator.\n\n**[Histogram](HISTOGRAM.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** A `histogram` transformation that computes the histogram DataFrame for a value column.\n\n**[Global Row Number](ROW_NUMBER.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** A `withRowNumbers` transformation that provides the global row number w.r.t.\nthe current order of the Dataset, or any given order. In contrast to the existing SQL function `row_number`, which\nrequires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.\n\n**[Partitioned Writing](PARTITIONING.md):** The `writePartitionedBy` action writes your `Dataset` partitioned and\nefficiently laid out with a single operation.\n\n**[Inspect Parquet files](PARQUET.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)\nor [parquet-cli](https://pypi.org/project/parquet-cli/) by reading from a simple Spark data source.\nThis simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.\n\n**[Install Python packages into PySpark job](PYSPARK-DEPS.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):\n\n```python\n# noinspection PyUnresolvedReferences\nfrom gresearch.spark import *\n\n# using PIP\nspark.install_pip_package(\"pandas==1.4.3\", \"pyarrow\")\nspark.install_pip_package(\"-r\", \"requirements.txt\")\n\n# using Poetry\nspark.install_poetry_project(\"../my-poetry-project/\", poetry_python=\"../venv-poetry/bin/python\")\n```\n\n**[Fluent method call](CONDITIONAL.md):** `T.call(transformation: T =\u003e R): R`: Turns a transformation `T =\u003e R`,\nthat is not part of `T` into a fluent method call on `T`. This allows writing fluent code like:\n\n```scala\nimport uk.co.gresearch._\n\ni.doThis()\n .doThat()\n .call(transformation)\n .doMore()\n```\n\n**[Fluent conditional method call](CONDITIONAL.md):** `T.when(condition: Boolean).call(transformation: T =\u003e T): T`:\nPerform a transformation fluently only if the given condition is true.\nThis allows writing fluent code like:\n\n```scala\nimport uk.co.gresearch._\n\ni.doThis()\n .doThat()\n .when(condition).call(transformation)\n .doMore()\n```\n\n**[Shortcut for groupBy.as](https://github.com/G-Research/spark-extension/pull/213#issue-2032837105)**: Calling `Dataset.groupBy(Column*).as[K, T]`\nshould be preferred over calling `Dataset.groupByKey(V =\u003e K)` whenever possible. The former allows Catalyst to exploit\nexisting partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.\nThis can have a significant performance penalty.\n\n\u003cdetails\u003e\n\u003csummary\u003eDetails:\u003c/summary\u003e\n\nThe new column-expression-based `groupByKey[K](Column*)` method makes it easier to group by a column expression key. Instead of\n\n    ds.groupBy($\"id\").as[Int, V]\n\nuse:\n\n    ds.groupByKey[Int]($\"id\")\n\u003c/details\u003e\n\n**Backticks:** `backticks(string: String, strings: String*): String)`: Encloses the given column name with backticks (`` ` ``) when needed.\nThis is a handy way to ensure column names with special characters like dots (`.`) work with `col()` or `select()`.\n\n**Count null values:** `count_null(e: Column)`: an aggregation function like `count` that counts null values in column `e`.\nThis is equivalent to calling `count(when(e.isNull, lit(1)))`.\n\n**.Net DateTime.Ticks[\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** Convert .Net (C#, F#, Visual Basic) `DateTime.Ticks` into Spark timestamps, seconds and nanoseconds.\n\n\u003cdetails\u003e\n\u003csummary\u003eAvailable methods:\u003c/summary\u003e\n\n```scala\n// Scala\ndotNetTicksToTimestamp(Column): Column       // returns timestamp as TimestampType\ndotNetTicksToUnixEpoch(Column): Column       // returns Unix epoch seconds as DecimalType\ndotNetTicksToUnixEpochNanos(Column): Column  // returns Unix epoch nanoseconds as LongType\n```\n\nThe reverse is provided by (all return `LongType` .Net ticks):\n```scala\n// Scala\ntimestampToDotNetTicks(Column): Column\nunixEpochToDotNetTicks(Column): Column\nunixEpochNanosToDotNetTicks(Column): Column\n```\n\nThese methods are also available in Python:\n```python\n# Python\ndotnet_ticks_to_timestamp(column_or_name)         # returns timestamp as TimestampType\ndotnet_ticks_to_unix_epoch(column_or_name)        # returns Unix epoch seconds as DecimalType\ndotnet_ticks_to_unix_epoch_nanos(column_or_name)  # returns Unix epoch nanoseconds as LongType\n\ntimestamp_to_dotnet_ticks(column_or_name)\nunix_epoch_to_dotnet_ticks(column_or_name)\nunix_epoch_nanos_to_dotnet_ticks(column_or_name)\n```\n\u003c/details\u003e\n\n**Spark temporary directory[\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server)**: Create a temporary directory that will be removed on Spark application shutdown.\n\n\u003cdetails\u003e\n\u003csummary\u003eExamples:\u003c/summary\u003e\n\nScala:\n```scala\nimport uk.co.gresearch.spark.createTemporaryDir\n\nval dir = createTemporaryDir(\"prefix\")\n```\n\nPython:\n```python\n# noinspection PyUnresolvedReferences\nfrom gresearch.spark import *\n\ndir = spark.create_temporary_dir(\"prefix\")\n```\n\u003c/details\u003e\n\n**Spark job description[\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** Set Spark job description for all Spark jobs within a context.\n\n\u003cdetails\u003e\n\u003csummary\u003eExamples:\u003c/summary\u003e\n\n```scala\nimport uk.co.gresearch.spark._\n\nimplicit val session: SparkSession = spark\n\nwithJobDescription(\"parquet file\") {\n  val df = spark.read.parquet(\"data.parquet\")\n  val count = appendJobDescription(\"count\") {\n    df.count\n  }\n  appendJobDescription(\"write\") {\n    df.write.csv(\"data.csv\")\n  }\n}\n```\n\n| Without job description  | With job description |\n|:---:|:---:|\n| ![](without-job-description.png \"Spark job without description in UI\") | ![](with-job-description.png \"Spark job with description in UI\") |\n\nNote that setting a description in one thread while calling the action (e.g. `.count`) in a different thread\ndoes not work, unless the different thread is spawned from the current thread _after_ the description has been set.\n\nWorking example with parallel collections:\n\n```scala\nimport java.util.concurrent.ForkJoinPool\nimport scala.collection.parallel.CollectionConverters.seqIsParallelizable\nimport scala.collection.parallel.ForkJoinTaskSupport\n\nval files = Seq(\"data1.csv\", \"data2.csv\").par\n\nval counts = withJobDescription(\"Counting rows\") {\n  // new thread pool required to spawn new threads from this thread\n  // so that the job description is actually used\n  files.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool())\n  files.map(filename =\u003e spark.read.csv(filename).count).sum\n}(spark)\n```\n\u003c/details\u003e\n\n## Using Spark Extension\n\nThe `spark-extension` package is available for all Spark 3.2, 3.3, 3.4 and 3.5 versions. Some earlier Spark versions may also be supported.\nThe package version has the following semantics: `spark-extension_{SCALA_COMPAT_VERSION}-{VERSION}-{SPARK_COMPAT_VERSION}`:\n\n- `SCALA_COMPAT_VERSION`: Scala binary compatibility (minor) version. Available are `2.12` and `2.13`.\n- `SPARK_COMPAT_VERSION`: Apache Spark binary compatibility (minor) version. Available are `3.2`, `3.3`, `3.4` and `3.5`.\n- `VERSION`: The package version, e.g. `2.10.0`.\n\n### SBT\n\nAdd this line to your `build.sbt` file:\n\n```sbt\nlibraryDependencies += \"uk.co.gresearch.spark\" %% \"spark-extension\" % \"2.13.0-3.5\"\n```\n\n### Maven\n\nAdd this dependency to your `pom.xml` file:\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003euk.co.gresearch.spark\u003c/groupId\u003e\n  \u003cartifactId\u003espark-extension_2.12\u003c/artifactId\u003e\n  \u003cversion\u003e2.13.0-3.5\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Gradle\n\nAdd this dependency to your `build.gradle` file:\n\n```groovy\ndependencies {\n    implementation \"uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5\"\n}\n```\n\n### Spark Submit\n\nSubmit your Spark app with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```shell script\nspark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5 [jar]\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.\n\n### Spark Shell\n\nLaunch a Spark Shell with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```shell script\nspark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark Shell version.\n\n### Python\n\n#### PySpark API\n\nStart a PySpark session with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.jars.packages\", \"uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5\") \\\n    .getOrCreate()\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.\n\n#### PySpark REPL\n\nLaunch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```shell script\npyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.\n\n#### PySpark `spark-submit`\n\nRun your Python scripts that use PySpark via `spark-submit`:\n\n```shell script\nspark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5 [script.py]\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.\n\n#### PyPi package (local Spark cluster only)\n\nYou may want to install the `pyspark-extension` python package from PyPi into your development environment.\nThis provides you code completion, typing and test capabilities during your development phase.\n\nRunning your Python application on a Spark cluster will still require one of the above ways\nto add the Scala package to the Spark environment.\n\n```shell script\npip install pyspark-extension==2.13.0.3.5\n```\n\nNote: Pick the right Spark version (here 3.5) depending on your PySpark version.\n\n### Your favorite Data Science notebook\n\nThere are plenty of [Data Science notebooks](https://datasciencenotebook.org/) around. To use this library,\nadd **a jar dependency** to your notebook using these **Maven coordinates**:\n\n    uk.co.gresearch.spark:spark-extension_2.12:2.13.0-3.5\n\nOr [download the jar](https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension) and place it\non a filesystem where it is accessible by the notebook, and reference that jar file directly.\n\nCheck the documentation of your favorite notebook to learn how to add jars to your Spark environment.\n\n## Known issues\n### Spark Connect Server\n\nMost features are not supported **in Python** in conjunction with a [Spark Connect server](https://spark.apache.org/docs/latest/spark-connect-overview.html).\nThis also holds for Databricks Runtime environment 13.x and above. Details can be found [in this blog](https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/).\n\nCalling any of those features when connected to a Spark Connect server will raise this error:\n\n    This feature is not supported for Spark Connect.\n\nUse a classic connection to a Spark cluster instead.\n\n## Build\n\nYou can build this project against different versions of Spark and Scala.\n\n### Switch Spark and Scala version\n\nIf you want to build for a Spark or Scala version different to what is defined in the `pom.xml` file, then run\n\n```shell script\nsh set-version.sh [SPARK-VERSION] [SCALA-VERSION]\n```\n\nFor example, switch to Spark 3.5.0 and Scala 2.13.8 by running `sh set-version.sh 3.5.0 2.13.8`.\n\n### Build the Scala project\n\nThen execute `mvn package` to create a jar from the sources. It can be found in `target/`.\n\n## Testing\n\nRun the Scala tests via `mvn test`.\n\n### Setup Python environment\n\nIn order to run the Python tests, setup a Python environment as follows (replace `[SCALA-COMPAT-VERSION]` and `[SPARK-COMPAT-VERSION]` with the respective values):\n\n```shell script\nvirtualenv -p python3 venv\nsource venv/bin/activate\npip install -r python/requirements-[SPARK-COMPAT-VERSION]_[SCALA-COMPAT-VERSION].txt\npip install pytest\n```\n\n### Run Python tests\n\nRun the Python tests via `env PYTHONPATH=python:python/test python -m pytest python/test`.\n\nNote: you first have to [build the Scala sources](#build-the-scala-project).\n\n### Build Python package\n\nRun the following sequence of commands in the project root directory:\n\n```shell script\nmkdir -p python/pyspark/jars/\ncp -v target/spark-extension_*-*.jar python/pyspark/jars/\npip install build\n```\n\nThen execute `python -m build python/` to create a whl from the sources. It can be found in `python/dist/`.\n\n## Publications\n\n- ***Guaranteeing in-partition order for partitioned-writing in Apache Spark**, Enrico Minack, 20/01/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/guaranteeing-in-partition-order-for-partitioned-writing-in-apache-spark/\n- ***Un-pivot, sorted groups and many bug fixes: Celebrating the first Spark 3.4 release**, Enrico Minack, 21/03/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/un-pivot-sorted-groups-and-many-bug-fixes-celebrating-the-first-spark-3-4-release/\n- ***A PySpark bug makes co-grouping with window function partition-key-order-sensitive**, Enrico Minack, 29/03/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/a-pyspark-bug-makes-co-grouping-with-window-function-partition-key-order-sensitive/\n- ***Spark’s groupByKey should be avoided – and here’s why**, Enrico Minack, 13/06/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/sparks-groupbykey-should-be-avoided-and-heres-why/\n- ***Inspecting Parquet files with Spark**, Enrico Minack, 28/07/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/parquet-files-know-your-scaling-limits/\n- ***Enhancing Spark’s UI with Job Descriptions**, Enrico Minack, 12/12/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/enhancing-sparks-ui-with-job-descriptions/\n- ***PySpark apps with dependencies: Managing Python dependencies in code**, Enrico Minack, 24/01/2024*:\u003cbr/\u003ehttps://www.gresearch.com/news/pyspark-apps-with-dependencies-managing-python-dependencies-in-code/\n- ***Observing Spark Aggregates: Cheap Metrics from Datasets**, Enrico Minack, 06/02/2024*:\u003cbr/\u003ehttps://www.gresearch.com/news/observing-spark-aggregates-cheap-metrics-from-datasets-2/\n\n## Security\n\nPlease see our [security policy](https://github.com/G-Research/spark-extension/blob/master/SECURITY.md) for details on reporting security vulnerabilities.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FG-Research%2Fspark-extension","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FG-Research%2Fspark-extension","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FG-Research%2Fspark-extension/lists"}