https://github.com/alonsodomin/sbt-spark

Simple SBT plugin to configure Spark applications
https://github.com/alonsodomin/sbt-spark

boilerplate sbt scala spark

Last synced: 8 months ago
JSON representation

Simple SBT plugin to configure Spark applications

Host: GitHub
URL: https://github.com/alonsodomin/sbt-spark
Owner: alonsodomin
License: mit
Created: 2017-01-28T11:45:00.000Z (almost 9 years ago)
Default Branch: master
Last Pushed: 2024-07-29T09:36:32.000Z (over 1 year ago)
Last Synced: 2025-04-01T07:24:40.616Z (9 months ago)
Topics: boilerplate, sbt, scala, spark
Language: Scala
Homepage:
Size: 90.8 KB
Stars: 24
Watchers: 3
Forks: 4
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# SBT Spark

[![Build Status](https://travis-ci.org/alonsodomin/sbt-spark.svg?branch=master)](https://travis-ci.org/alonsodomin/sbt-spark)
[![Latest version](https://index.scala-lang.org/alonsodomin/sbt-spark/sbt-spark/latest.svg?color=green)](https://index.scala-lang.org/alonsodomin/sbt-spark)

This is a very simple plugin focused on adding all the boilerplate that you need to configure a Spark application
in SBT so you do not have to.

## Getting started

Add the following line to your `project/plugins.sbt` file:

```scala
addSbtPlugin("com.github.alonsodomin" % "sbt-spark" % "x.y.z")
```

Then enable the plugin in your `build.sbt` file:

```scala
enablePlugins(SparkPlugin)
```

Write your Spark app:

```scala
import org.apache.spark._

object SimpleSparkApp {

def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("Simple Spark Application")
.set("spark.logConf", "false")

val sc = new SparkContext(conf)
val count = sc.parallelize(Seq("Hello", "from", "Spark"), 1).count()
println(s"Count result: $count")

sc.stop()
}

}
```

And run it!

```bash
sbt run
```

If you want to package it so you can run it from your spark cluster, then use the `assembly` command:

```bash
sbt assembly
```

Your application package should be found now under the `target` folder of your local copy. And now you're all set, you are ready to start writing your awesome Spark application!

## Migrating to 0.4.0

In versions prior to 0.4.0 the plugin used to be enabled by default by just adding the plugin dependency to the project.
This didn't use to play well with multi-module setups since it led to the annoyance of having to disable it explicitly in
all the modules that did not require it, including the root project.

Starting at 0.4.0, the plugin needs to be enabled explicitly, this means adding one single line to single-module projects
(as stated in the _Getting Started_ section) and allows users of multi-module setups to choose which modules do require
the Spark features and which don't.

## Usage

### Choosing the Spark version:

By default the plugin will use Spark `2.4.3`. If you want to use a different version just put the following in your `build.sbt`:

```scala
sparkVersion := "1.6.3"
```

### Adding Spark components (modules) to my build

By default the plugin will only put `spark-core` in your classpath. If you want to use any other additional Spark module just
use the following syntax in your `build.sbt` file:

```scala
sparkComponents += "sql"
```

```scala
sparkComponents ++= Seq("sql", "mllib")
```

This is equivalent to having the following in your `build.sbt` file:

```scala
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion.value % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion.value % Provided,
"org.apache.spark" %% "spark-mllib" % sparkVersion.value % Compile
)
```

As you can see, the plugin will also handle the dependency scope properly, meaning that the `sql` component will be
put in the `Provided` scope whilst the `mllib` one will be packaged with your app.

### Exclude Spark transitive dependencies

Transitive dependencies brought by any of the Spark modules added to the project via the `sparkComponents` setting can be excluded using the `sparkExclusionRules` setting:

```scala
sparkExclusionRules += ExclusionRule(organization = "org.slf4j", name = "slf4j-log4j12")
```

That setting will affect as many of the Spark components you have configured previously (removing the boilerplate of having to add similar rules to each of your Spark dependencies).

### Override scope for individual components

Using the `sparkComponentScope` key you can configure the actual dependency scope of each of the Spark modules:

```scala
sparkComponentScope += ("sql" -> Compile)
```

That will make the `spark-sql` module be in the `Compile` scope (and therefore making it part of the final assembly jar).

### Publishing artifacts

Although is not usually recommended to publish assembly jars into repositories like Nexus or Artifactory. In several places is considered acceptable with regards to Spark applications, since they are usually not the type of JAR file that others will depend upon.

This plugin will allow you to do so using either the `publishLocal` or `publish` SBT tasks, without any additional configuration. You should see how an artifact with a `spark` classifier has been published to your repository. If interested on customizing this, you can use the following:

```scala
sparkClassifier := "myclassifier"
```

That will generate an assembly JAR file with the `myclassifier` suffix.

### Customizing your application package

`sbt-spark` uses `sbt-assembly` with some sensible defaults. To get a package that you can deploy in your Spark cluster,
just run `sbt assembly` from the command line (as stated before).

If you need to customize your package, refer to [`sbt-assembly`'s website](https://github.com/sbt/sbt-assembly), all the
configuration keys from it are available as if you where using the plugin yourself.

## FAQ

### What does this plugin do?

Very little in fact, Spark applications all have the same setup boilerplate:

* Add Spark dependencies and place them in the `provided` scope.
* Configure `sbt-assembly` to package your application in an uberjar, which in itself means setting up the proper merge strategy and removing Scala's stdlib from the final artifact.
* Re-configure SBT's `run` task so you still can run your app locally with the correct classpath.

It's a PITA to repeat this all over again every time you want to start a brand new Spark application, so `sbt-spark` does it for you. Simple as that.

### How is this different from `sbt-spark-packages`

Well, it's not really that different, this could even be considered a slimmed down version of the same "utility" but just
catering to a different audience.

`sbt-spark-packages` is meant to be used by developers that want to write extensions on top of Spark, _packages_ that other
Spark applications can use so it's very focused on giving you a good starting point plus a platform to ditribute your packages
to other users.

`sbt-spark` is meant to be a boilerplate-free starting point to write Spark applications. It's main audience is Spark developers
that write _end of the world_ Spark applications. Also, `sbt-spark` could be useful to support tooling around writing Spark applications,
(i.e.: test harness libraries) which still require the same starting point, but not fit into the _Spark package_ concept.

### I'm a library author targeting different versions of Spark, can this plugin support "cross Spark compiling"?

Not yet but but it might be doable. It's also questionable if this the right project for it.

## MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alonsodomin/sbt-spark

Awesome Lists containing this project

README