Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cwienberg/spark-sorting-helpers
Helper library for using secondary sorting in Spark RDD and Dataset operations
https://github.com/cwienberg/spark-sorting-helpers
scala spark
Last synced: 6 days ago
JSON representation
Helper library for using secondary sorting in Spark RDD and Dataset operations
- Host: GitHub
- URL: https://github.com/cwienberg/spark-sorting-helpers
- Owner: cwienberg
- License: mit
- Created: 2020-10-17T03:14:09.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-03-20T15:40:11.000Z (11 months ago)
- Last Synced: 2024-03-20T16:56:30.278Z (11 months ago)
- Topics: scala, spark
- Language: Scala
- Homepage:
- Size: 1.22 MB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Spark Sorting Helpers
[![build status](https://github.com/cwienberg/spark-sorting-helpers/actions/workflows/release.yml/badge.svg)](https://github.com/cwienberg/spark-sorting-helpers/actions/workflows/release.yml) [![codecov](https://codecov.io/gh/cwienberg/spark-sorting-helpers/branch/main/graph/badge.svg?token=IC5NUTYXHI)](https://codecov.io/gh/cwienberg/spark-sorting-helpers) [![Sonatype Nexus (Snapshots)](https://img.shields.io/nexus/r/https/s01.oss.sonatype.org/net.gonzberg/spark-sorting-helpers_2.12.svg)](https://s01.oss.sonatype.org/content/repositories/releases/net/gonzberg/spark-sorting-helpers_2.12/) [![Sonatype Nexus (Snapshots)](https://img.shields.io/nexus/s/https/s01.oss.sonatype.org/net.gonzberg/spark-sorting-helpers_2.12.svg)](https://s01.oss.sonatype.org/content/repositories/snapshots/net/gonzberg/spark-sorting-helpers_2.12/)
The spark sorting helpers is a library of convenience functions for leveraging the secondary sort functionality of Spark partitioning. Secondary sorting allows an RDD or Dataset to be partitioned by a key while sorting the values, pushing that sort into the underlying shuffle machinery. This provides an efficient way to sort values within a partition if one is already conducting a shuffle operation anyway (e.g. a join or groupBy).
## Usage
This library uses the extension methods pattern to add methods to RDDs or Datasets of pairs. You can import the implicits with:
```scala
import net.gonzberg.spark.sorting.implicits._
```You can then call additional functions on certain RDDs or Datasets, e.g.
```scala
val rdd: RDD[(String, Int)] = ???
val groupedRDD: RDD[(String, Iterable[Int])] = rdd.sortedGroupByKey
groupedRDD.foreach((k, group) => assert group == group.sorted)
```## Supported Versions
This library attempts to support Scala `2.11`, `2.12`, and `2.13`. Since there is not a single version of Spark which supports all three of those Scala versions, this library is built against different versions of Spark depending on the Scala version.| Scala | Spark |
| ----- | ----- |
| 2.11 | 2.4.8 |
| 2.12 | 3.3.2 |
| 2.13 | 3.3.2 |Other combinations of versions may also work, but these are the ones for which the tests run automatically. We will likely drop `2.11` support in a later release, depending on when it becomes too difficult to support.
## Documentation
Scaladocs are avaiable [here](https://cwienberg.github.io/spark-sorting-helpers/).
## Development
This package is built using `sbt`. You can run the tests with `sbt test`. You can lint with `sbt scalafmt`. You can use `+` in front of a directive to cross-build, though you'll need Java 8 (as opposed to Java 11) to cross-build to Scala 2.11.