{"id":20749140,"url":"https://github.com/g-research/spark-extension","last_synced_at":"2025-10-06T21:54:44.017Z","repository":{"id":36968961,"uuid":"243452117","full_name":"G-Research/spark-extension","owner":"G-Research","description":"A library that provides useful extensions to Apache Spark and PySpark.","archived":false,"fork":false,"pushed_at":"2025-07-22T11:29:33.000Z","size":1214,"stargazers_count":229,"open_issues_count":9,"forks_count":28,"subscribers_count":17,"default_branch":"master","last_synced_at":"2025-09-04T16:52:18.003Z","etag":null,"topics":["gr-oss","java","pyspark","python","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/G-Research.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-02-27T06:55:16.000Z","updated_at":"2025-08-13T15:04:06.000Z","dependencies_parsed_at":"2024-05-31T07:48:30.264Z","dependency_job_id":"56d45774-9382-4c45-a51a-1705136f132c","html_url":"https://github.com/G-Research/spark-extension","commit_stats":null,"previous_names":[],"tags_count":25,"template":false,"template_full_name":null,"purl":"pkg:github/G-Research/spark-extension","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/G-Research","download_url":"https://codeload.github.com/G-Research/spark-extension/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/G-Research%2Fspark-extension/sbom","scorecard":{"id":53706,"data":{"date":"2025-08-11","repo":{"name":"github.com/G-Research/spark-extension","commit":"6acecd09f922d6e56eea6ffab05ab9c426cfb295"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":4.7,"checks":[{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":10,"reason":"30 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 10","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Security-Policy","score":10,"reason":"security policy file detected","details":["Info: security policy file detected: SECURITY.md:1","Info: Found linked content: SECURITY.md:1","Info: Found disclosure, vulnerability, and/or timelines in security policy: SECURITY.md:1","Info: Found text in security policy: SECURITY.md:1"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: jobLevel 'contents' permission set to 'write': .github/workflows/prepare-release.yml:163","Info: found token with 'none' permissions: .github/workflows/publish-release.yml:1","Info: found token with 'none' permissions: .github/workflows/publish-snapshot.yml:1","Info: found token with 'none' permissions: .github/workflows/publish-snapshot.yml:1","Warn: jobLevel 'checks' permission set to 'write': .github/workflows/test-results.yml:16","Warn: no topLevel permission defined: .github/workflows/build-jvm.yml:1","Warn: no topLevel permission defined: .github/workflows/build-python.yml:1","Warn: no topLevel permission defined: .github/workflows/build-snapshots.yml:1","Warn: no topLevel permission defined: .github/workflows/check.yml:1","Warn: no topLevel permission defined: .github/workflows/ci.yml:1","Warn: topLevel 'actions' permission set to 'write': .github/workflows/clear-caches.yaml:7","Warn: no topLevel permission defined: .github/workflows/prepare-release.yml:1","Warn: no topLevel permission defined: .github/workflows/prime-caches.yml:1","Warn: no topLevel permission defined: .github/workflows/publish-release.yml:1","Warn: no topLevel permission defined: .github/workflows/publish-snapshot.yml:1","Warn: no topLevel permission defined: .github/workflows/test-jvm.yml:1","Warn: no topLevel permission defined: .github/workflows/test-python.yml:1","Warn: no topLevel permission defined: .github/workflows/test-release.yml:1","Info: found token with 'none' permissions: .github/workflows/test-results.yml:1","Warn: no topLevel permission defined: .github/workflows/test-snapshots.yml:1"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Pinned-Dependencies","score":1,"reason":"dependency not pinned by hash detected -- score normalized to 1","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/build-jvm.yml:85: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/build-jvm.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/build-python.yml:54: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/build-python.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/build-snapshots.yml:69: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/build-snapshots.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/check.yml:106: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/check.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/check.yml:13: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/check.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/check.yml:18: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/check.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/check.yml:46: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/check.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/ci.yml:19: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/ci.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/clear-caches.yaml:14: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/clear-caches.yaml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/prepare-release.yml:21: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/prepare-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/prepare-release.yml:48: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/prepare-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/prepare-release.yml:63: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/prepare-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/prepare-release.yml:167: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/prepare-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/prime-caches.yml:138: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/prime-caches.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-release.yml:40: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-release.yml:96: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-release.yml:114: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-release.yml:151: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-release.yml:159: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-release.yml:165: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-release.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/publish-release.yml:187: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-release.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-snapshot.yml:22: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-snapshot.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-snapshot.yml:68: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-snapshot.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-snapshot.yml:84: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-snapshot.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-snapshot.yml:90: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-snapshot.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish-snapshot.yml:115: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/publish-snapshot.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/test-jvm.yml:86: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/test-jvm.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/test-python.yml:94: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/test-python.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/test-release.yml:92: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/test-release.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/test-results.yml:29: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/test-results.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/test-snapshots.yml:69: update your workflow using https://app.stepsecurity.io/secureworkflow/G-Research/spark-extension/test-snapshots.yml/master?enable=pin","Warn: containerImage not pinned by hash: examples/python-deps/Dockerfile:1: pin your Docker image by updating apache/spark:3.5.0 to apache/spark:3.5.0@sha256:0ed5154e6b32ac3af1272d4d65e9f65b13afcfe80b41ad10bd059bcd6317863c","Warn: pipCommand not pinned by hash: build-whl.sh:12","Warn: pipCommand not pinned by hash: release.sh:121","Info:   3 out of  32 GitHub-owned GitHubAction dependencies pinned","Info:   2 out of   4 third-party GitHubAction dependencies pinned","Info:   0 out of   1 containerImage dependencies pinned","Info:   0 out of   2 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":3,"reason":"branch protection is not maximal on development and all release branches","details":["Info: 'allow deletion' disabled on branch 'master'","Info: 'force pushes' disabled on branch 'master'","Warn: 'branch protection settings apply to administrators' is disabled on branch 'master'","Warn: could not determine whether codeowners review is allowed","Warn: no status checks found to merge onto branch 'master'","Warn: PRs are not required to make changes on branch 'master'; or we don't have data to detect it.If you think it might be the latter, make sure to run Scorecard with a PAT or use Repo Rules (that are always public) instead of Branch Protection settings"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Packaging","score":10,"reason":"packaging workflow detected","details":["Info: Project packages its releases by way of GitHub Actions.: .github/workflows/publish-snapshot.yml:37"],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Vulnerabilities","score":0,"reason":"35 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-4jrv-ppp4-jm57","Warn: Project is vulnerable to: GHSA-5mg8-w23w-74h3","Warn: Project is vulnerable to: GHSA-7g45-4rm6-3mm3","Warn: Project is vulnerable to: GHSA-mvr2-9pj6-7w5j","Warn: Project is vulnerable to: GHSA-4gg5-vx3j-xwc7","Warn: Project is vulnerable to: GHSA-735f-pc8j-v9w8","Warn: Project is vulnerable to: GHSA-77rm-9x9h-xj3g","Warn: Project is vulnerable to: GHSA-g5ww-5jh7-63cx","Warn: Project is vulnerable to: GHSA-h4h5-3hr4-j3g2","Warn: Project is vulnerable to: GHSA-wrvw-hg22-4m67","Warn: Project is vulnerable to: GHSA-78wr-2p64-hpwj","Warn: Project is vulnerable to: GHSA-j288-q9x7-2f5v","Warn: Project is vulnerable to: GHSA-973x-65j7-xcf4","Warn: Project is vulnerable to: GHSA-5jpm-x58v-624v","Warn: Project is vulnerable to: GHSA-xpw8-rcwv-8f8p","Warn: Project is vulnerable to: GHSA-389x-839f-4rhx","Warn: Project is vulnerable to: GHSA-xq3w-v528-46rv","Warn: Project is vulnerable to: GHSA-4g8c-wm8x-jfhw","Warn: Project is vulnerable to: GHSA-r7pg-v2c8-mfg3","Warn: Project is vulnerable to: GHSA-rhrv-645h-fjfh","Warn: Project is vulnerable to: GHSA-4265-ccf5-phj5","Warn: Project is vulnerable to: GHSA-4g9r-vxhx-9pgx","Warn: Project is vulnerable to: GHSA-cgwf-w82q-5jrr","Warn: Project is vulnerable to: GHSA-rcjc-c4pj-xxrp","Warn: Project is vulnerable to: GHSA-c476-j253-5rgq","Warn: Project is vulnerable to: GHSA-p953-3j66-hg45","Warn: Project is vulnerable to: GHSA-2jc4-r94c-rp7h","Warn: Project is vulnerable to: GHSA-g2fg-mr77-6vrm","Warn: Project is vulnerable to: GHSA-rj7p-rfgp-852x","Warn: Project is vulnerable to: GHSA-7286-pgfv-vxvh","Warn: Project is vulnerable to: GHSA-r978-9m6m-6gm6","Warn: Project is vulnerable to: GHSA-c27h-mcmw-48hv","Warn: Project is vulnerable to: GHSA-r6j9-8759-g62w","Warn: Project is vulnerable to: GHSA-8qv5-68g4-248j","Warn: Project is vulnerable to: GHSA-55g7-9cwv-5qfv"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 28 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-15T00:13:41.489Z","repository_id":36968961,"created_at":"2025-08-15T00:13:41.490Z","updated_at":"2025-08-15T00:13:41.490Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278686638,"owners_count":26028325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gr-oss","java","pyspark","python","scala","spark"],"created_at":"2024-11-17T08:21:14.514Z","updated_at":"2025-10-06T21:54:44.011Z","avatar_url":"https://github.com/G-Research.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spark Extension\n\nThis project provides extensions to the [Apache Spark project](https://spark.apache.org/) in Scala and Python:\n\n**[Diff](DIFF.md):** A `diff` transformation and application for `Dataset`s that computes the differences between\ntwo datasets, i.e. which rows to _add_, _delete_ or _change_ to get from one dataset to the other.\n\n**[SortedGroups](GROUPS.md):** A `groupByKey` transformation that groups rows by a key while providing\na **sorted** iterator for each group. Similar to `Dataset.groupByKey.flatMapGroups`, but with order guarantees\nfor the iterator.\n\n**[Histogram](HISTOGRAM.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** A `histogram` transformation that computes the histogram DataFrame for a value column.\n\n**[Global Row Number](ROW_NUMBER.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** A `withRowNumbers` transformation that provides the global row number w.r.t.\nthe current order of the Dataset, or any given order. In contrast to the existing SQL function `row_number`, which\nrequires a window spec, this transformation provides the row number across the entire Dataset without scaling problems.\n\n**[Partitioned Writing](PARTITIONING.md):** The `writePartitionedBy` action writes your `Dataset` partitioned and\nefficiently laid out with a single operation.\n\n**[Inspect Parquet files](PARQUET.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** The structure of Parquet files (the metadata, not the data stored in Parquet) can be inspected similar to [parquet-tools](https://pypi.org/project/parquet-tools/)\nor [parquet-cli](https://pypi.org/project/parquet-cli/) by reading from a simple Spark data source.\nThis simplifies identifying why some Parquet files cannot be split by Spark into scalable partitions.\n\n**[Install Python packages into PySpark job](PYSPARK-DEPS.md) [\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** Install Python dependencies via PIP or Poetry programatically into your running PySpark job (PySpark ≥ 3.1.0):\n\n```python\n# noinspection PyUnresolvedReferences\nfrom gresearch.spark import *\n\n# using PIP\nspark.install_pip_package(\"pandas==1.4.3\", \"pyarrow\")\nspark.install_pip_package(\"-r\", \"requirements.txt\")\n\n# using Poetry\nspark.install_poetry_project(\"../my-poetry-project/\", poetry_python=\"../venv-poetry/bin/python\")\n```\n\n**[Fluent method call](CONDITIONAL.md):** `T.call(transformation: T =\u003e R): R`: Turns a transformation `T =\u003e R`,\nthat is not part of `T` into a fluent method call on `T`. This allows writing fluent code like:\n\n```scala\nimport uk.co.gresearch._\n\ni.doThis()\n .doThat()\n .call(transformation)\n .doMore()\n```\n\n**[Fluent conditional method call](CONDITIONAL.md):** `T.when(condition: Boolean).call(transformation: T =\u003e T): T`:\nPerform a transformation fluently only if the given condition is true.\nThis allows writing fluent code like:\n\n```scala\nimport uk.co.gresearch._\n\ni.doThis()\n .doThat()\n .when(condition).call(transformation)\n .doMore()\n```\n\n**[Shortcut for groupBy.as](https://github.com/G-Research/spark-extension/pull/213#issue-2032837105)**: Calling `Dataset.groupBy(Column*).as[K, T]`\nshould be preferred over calling `Dataset.groupByKey(V =\u003e K)` whenever possible. The former allows Catalyst to exploit\nexisting partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.\nThis can have a significant performance penalty.\n\n\u003cdetails\u003e\n\u003csummary\u003eDetails:\u003c/summary\u003e\n\nThe new column-expression-based `groupByKey[K](Column*)` method makes it easier to group by a column expression key. Instead of\n\n    ds.groupBy($\"id\").as[Int, V]\n\nuse:\n\n    ds.groupByKey[Int]($\"id\")\n\u003c/details\u003e\n\n**Backticks:** `backticks(string: String, strings: String*): String)`: Encloses the given column name with backticks (`` ` ``) when needed.\nThis is a handy way to ensure column names with special characters like dots (`.`) work with `col()` or `select()`.\n\n**Count null values:** `count_null(e: Column)`: an aggregation function like `count` that counts null values in column `e`.\nThis is equivalent to calling `count(when(e.isNull, lit(1)))`.\n\n**.Net DateTime.Ticks[\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** Convert .Net (C#, F#, Visual Basic) `DateTime.Ticks` into Spark timestamps, seconds and nanoseconds.\n\n\u003cdetails\u003e\n\u003csummary\u003eAvailable methods:\u003c/summary\u003e\n\n```scala\n// Scala\ndotNetTicksToTimestamp(Column): Column       // returns timestamp as TimestampType\ndotNetTicksToUnixEpoch(Column): Column       // returns Unix epoch seconds as DecimalType\ndotNetTicksToUnixEpochNanos(Column): Column  // returns Unix epoch nanoseconds as LongType\n```\n\nThe reverse is provided by (all return `LongType` .Net ticks):\n```scala\n// Scala\ntimestampToDotNetTicks(Column): Column\nunixEpochToDotNetTicks(Column): Column\nunixEpochNanosToDotNetTicks(Column): Column\n```\n\nThese methods are also available in Python:\n```python\n# Python\ndotnet_ticks_to_timestamp(column_or_name)         # returns timestamp as TimestampType\ndotnet_ticks_to_unix_epoch(column_or_name)        # returns Unix epoch seconds as DecimalType\ndotnet_ticks_to_unix_epoch_nanos(column_or_name)  # returns Unix epoch nanoseconds as LongType\n\ntimestamp_to_dotnet_ticks(column_or_name)\nunix_epoch_to_dotnet_ticks(column_or_name)\nunix_epoch_nanos_to_dotnet_ticks(column_or_name)\n```\n\u003c/details\u003e\n\n**Spark temporary directory[\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server)**: Create a temporary directory that will be removed on Spark application shutdown.\n\n\u003cdetails\u003e\n\u003csummary\u003eExamples:\u003c/summary\u003e\n\nScala:\n```scala\nimport uk.co.gresearch.spark.createTemporaryDir\n\nval dir = createTemporaryDir(\"prefix\")\n```\n\nPython:\n```python\n# noinspection PyUnresolvedReferences\nfrom gresearch.spark import *\n\ndir = spark.create_temporary_dir(\"prefix\")\n```\n\u003c/details\u003e\n\n**Spark job description[\u003csup\u003e[*]\u003c/sup\u003e](#spark-connect-server):** Set Spark job description for all Spark jobs within a context.\n\n\u003cdetails\u003e\n\u003csummary\u003eExamples:\u003c/summary\u003e\n\n```scala\nimport uk.co.gresearch.spark._\n\nimplicit val session: SparkSession = spark\n\nwithJobDescription(\"parquet file\") {\n  val df = spark.read.parquet(\"data.parquet\")\n  val count = appendJobDescription(\"count\") {\n    df.count\n  }\n  appendJobDescription(\"write\") {\n    df.write.csv(\"data.csv\")\n  }\n}\n```\n\n| Without job description  | With job description |\n|:---:|:---:|\n| ![](without-job-description.png \"Spark job without description in UI\") | ![](with-job-description.png \"Spark job with description in UI\") |\n\nNote that setting a description in one thread while calling the action (e.g. `.count`) in a different thread\ndoes not work, unless the different thread is spawned from the current thread _after_ the description has been set.\n\nWorking example with parallel collections:\n\n```scala\nimport java.util.concurrent.ForkJoinPool\nimport scala.collection.parallel.CollectionConverters.seqIsParallelizable\nimport scala.collection.parallel.ForkJoinTaskSupport\n\nval files = Seq(\"data1.csv\", \"data2.csv\").par\n\nval counts = withJobDescription(\"Counting rows\") {\n  // new thread pool required to spawn new threads from this thread\n  // so that the job description is actually used\n  files.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool())\n  files.map(filename =\u003e spark.read.csv(filename).count).sum\n}(spark)\n```\n\u003c/details\u003e\n\n## Using Spark Extension\n\nThe `spark-extension` package is available for all Spark 3.2, 3.3, 3.4 and 3.5 versions. Some earlier Spark versions may also be supported.\nThe package version has the following semantics: `spark-extension_{SCALA_COMPAT_VERSION}-{VERSION}-{SPARK_COMPAT_VERSION}`:\n\n- `SCALA_COMPAT_VERSION`: Scala binary compatibility (minor) version. Available are `2.12` and `2.13`.\n- `SPARK_COMPAT_VERSION`: Apache Spark binary compatibility (minor) version. Available are `3.2`, `3.3`, `3.4` and `3.5`.\n- `VERSION`: The package version, e.g. `2.10.0`.\n\n### SBT\n\nAdd this line to your `build.sbt` file:\n\n```sbt\nlibraryDependencies += \"uk.co.gresearch.spark\" %% \"spark-extension\" % \"2.14.0-3.5\"\n```\n\n### Maven\n\nAdd this dependency to your `pom.xml` file:\n\n```xml\n\u003cdependency\u003e\n  \u003cgroupId\u003euk.co.gresearch.spark\u003c/groupId\u003e\n  \u003cartifactId\u003espark-extension_2.12\u003c/artifactId\u003e\n  \u003cversion\u003e2.14.0-3.5\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Gradle\n\nAdd this dependency to your `build.gradle` file:\n\n```groovy\ndependencies {\n    implementation \"uk.co.gresearch.spark:spark-extension_2.12:2.14.0-3.5\"\n}\n```\n\n### Spark Submit\n\nSubmit your Spark app with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```shell script\nspark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.14.0-3.5 [jar]\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.\n\n### Spark Shell\n\nLaunch a Spark Shell with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```shell script\nspark-shell --packages uk.co.gresearch.spark:spark-extension_2.12:2.14.0-3.5\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark Shell version.\n\n### Python\n\n#### PySpark API\n\nStart a PySpark session with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```python\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession \\\n    .builder \\\n    .config(\"spark.jars.packages\", \"uk.co.gresearch.spark:spark-extension_2.12:2.14.0-3.5\") \\\n    .getOrCreate()\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.\n\n#### PySpark REPL\n\nLaunch the Python Spark REPL with the Spark Extension dependency (version ≥1.1.0) as follows:\n\n```shell script\npyspark --packages uk.co.gresearch.spark:spark-extension_2.12:2.14.0-3.5\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your PySpark version.\n\n#### PySpark `spark-submit`\n\nRun your Python scripts that use PySpark via `spark-submit`:\n\n```shell script\nspark-submit --packages uk.co.gresearch.spark:spark-extension_2.12:2.14.0-3.5 [script.py]\n```\n\nNote: Pick the right Scala version (here 2.12) and Spark version (here 3.5) depending on your Spark version.\n\n#### PyPi package (local Spark cluster only)\n\nYou may want to install the `pyspark-extension` python package from PyPi into your development environment.\nThis provides you code completion, typing and test capabilities during your development phase.\n\nRunning your Python application on a Spark cluster will still require one of the above ways\nto add the Scala package to the Spark environment.\n\n```shell script\npip install pyspark-extension==2.14.0.3.5\n```\n\nNote: Pick the right Spark version (here 3.5) depending on your PySpark version.\n\n### Your favorite Data Science notebook\n\nThere are plenty of [Data Science notebooks](https://datasciencenotebook.org/) around. To use this library,\nadd **a jar dependency** to your notebook using these **Maven coordinates**:\n\n    uk.co.gresearch.spark:spark-extension_2.12:2.14.0-3.5\n\nOr [download the jar](https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension) and place it\non a filesystem where it is accessible by the notebook, and reference that jar file directly.\n\nCheck the documentation of your favorite notebook to learn how to add jars to your Spark environment.\n\n## Known issues\n### Spark Connect Server\n\nMost features are not supported **in Python** in conjunction with a [Spark Connect server](https://spark.apache.org/docs/latest/spark-connect-overview.html).\nThis also holds for Databricks Runtime environment 13.x and above. Details can be found [in this blog](https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/).\n\nCalling any of those features when connected to a Spark Connect server will raise this error:\n\n    This feature is not supported for Spark Connect.\n\nUse a classic connection to a Spark cluster instead.\n\n## Build\n\nYou can build this project against different versions of Spark and Scala.\n\n### Switch Spark and Scala version\n\nIf you want to build for a Spark or Scala version different to what is defined in the `pom.xml` file, then run\n\n```shell script\nsh set-version.sh [SPARK-VERSION] [SCALA-VERSION]\n```\n\nFor example, switch to Spark 3.5.0 and Scala 2.13.8 by running `sh set-version.sh 3.5.0 2.13.8`.\n\n### Build the Scala project\n\nThen execute `mvn package` to create a jar from the sources. It can be found in `target/`.\n\n## Testing\n\nRun the Scala tests via `mvn test`.\n\n### Setup Python environment\n\nIn order to run the Python tests, setup a Python environment as follows:\n\n```shell script\nvirtualenv -p python3 venv\nsource venv/bin/activate\npip install python/[test]\n```\n\n### Run Python tests\n\nRun the Python tests via `env PYTHONPATH=python/test python -m pytest python/test`.\n\n### Build Python package\n\nRun the following commands in the project root directory to create a whl from the sources:\n\n```shell script\npip install build\npython -m build python/\n```\n\nIt can be found in `python/dist/`.\n\n## Publications\n\n- ***Guaranteeing in-partition order for partitioned-writing in Apache Spark**, Enrico Minack, 20/01/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/guaranteeing-in-partition-order-for-partitioned-writing-in-apache-spark/\n- ***Un-pivot, sorted groups and many bug fixes: Celebrating the first Spark 3.4 release**, Enrico Minack, 21/03/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/un-pivot-sorted-groups-and-many-bug-fixes-celebrating-the-first-spark-3-4-release/\n- ***A PySpark bug makes co-grouping with window function partition-key-order-sensitive**, Enrico Minack, 29/03/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/a-pyspark-bug-makes-co-grouping-with-window-function-partition-key-order-sensitive/\n- ***Spark’s groupByKey should be avoided – and here’s why**, Enrico Minack, 13/06/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/sparks-groupbykey-should-be-avoided-and-heres-why/\n- ***Inspecting Parquet files with Spark**, Enrico Minack, 28/07/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/parquet-files-know-your-scaling-limits/\n- ***Enhancing Spark’s UI with Job Descriptions**, Enrico Minack, 12/12/2023*:\u003cbr/\u003ehttps://www.gresearch.com/blog/article/enhancing-sparks-ui-with-job-descriptions/\n- ***PySpark apps with dependencies: Managing Python dependencies in code**, Enrico Minack, 24/01/2024*:\u003cbr/\u003ehttps://www.gresearch.com/news/pyspark-apps-with-dependencies-managing-python-dependencies-in-code/\n- ***Observing Spark Aggregates: Cheap Metrics from Datasets**, Enrico Minack, 06/02/2024*:\u003cbr/\u003ehttps://www.gresearch.com/news/observing-spark-aggregates-cheap-metrics-from-datasets-2/\n\n## Security\n\nPlease see our [security policy](https://github.com/G-Research/spark-extension/blob/master/SECURITY.md) for details on reporting security vulnerabilities.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg-research%2Fspark-extension","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fg-research%2Fspark-extension","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fg-research%2Fspark-extension/lists"}