Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/hindog/grid-executor

Library for remote JVM ExecutorService with only dependency being password-less SSH -- Run clustered Hadoop/Spark jobs from IDE -- IDE-pimped Spark shell with full auto-completion!
https://github.com/hindog/grid-executor
cloud grid hadoop ide jvm spark-shell
Last synced: 19 days ago
JSON representation
Library for remote JVM ExecutorService with only dependency being password-less SSH -- Run clustered Hadoop/Spark jobs from IDE -- IDE-pimped Spark shell with full auto-completion!
Host: GitHub
URL: https://github.com/hindog/grid-executor
Owner: hindog
License: apache-2.0
Created: 2017-01-09T21:19:51.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2021-02-11T08:35:18.000Z (almost 4 years ago)
Last Synced: 2024-11-19T10:57:15.096Z (3 months ago)
Topics: cloud, grid, hadoop, ide, jvm, spark-shell
Language: Scala
Homepage:
Size: 201 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        ## Spark and JVM Remote Execution ##

This project essentially allows you to replicate a local JVM process on one or more remote hosts using only SSH.  You can then execute any local code on the remote hosts with full STDOUT/STDERR streaming back to the local process. 

**This is extremely handy for Spark jobs, because it allows running Spark jobs from your local IDE on a remote cluster just like any other application.** 

#### Features ####

* Zero-deployment remote JVM execution.  Automatically replicates the local classpath to the remote target(s) while also caching JAR's on the remote host for faster execution on repeated runs. 

* Support for remote Spark/Hadoop execution from IDE for fast, iterative development and feedback (ie: `spark-submit` or `hadoop` on a hadoop gateway box, without manually uploading jars).

* Implements `ExecutorService` to support submitting `Runnable` and/or `Callable[T]` to the grid nodes.

* Contains hooks for Scala `Future[T]` to allow for transparent grid execution by wrapping the `GridExecutor` in a Scala `ExecutionContext`.

* By default, the library will bind remote STDOUT/STDERR to local STDOUT/STDERR and optionally STDIN can be bound as well.

* Can be integrated with [JClouds](https://jclouds.apache.org/) to provision grids on-the-fly 

* Open-Source, Apache 2.0 License

* Support for "IDE-pimped" `spark-shell` that gives you full power of the IDE's completion/import/copy-paste support while interacting with a Spark shell running remotely on the cluster! (See [SparkShellExample.scala](https://github.com/hindog/grid-executor/blob/master/grid-executor-examples/src/main/scala/com/hindog/grid/examples/SparkShellExample.scala) for instructions)

#### Import ####

Using SBT:

```

libraryDependencies += "com.hindog.grid" %% "grid-executor-core" % "2.0.7"

```

Using Maven:

```xml

    com.hindog.grid

    grid-executor-core_2.11

    2.0.7

```

Package Import:

```scala

import com.hindog.grid._

```

#### Configuration ####

Configuration is provided via the `GridConfig` builder methods or a properties file (or both).  Properties are scoped by `grid.`.  

```

# Configures your remote username (if different from your local username)

grid.remote\:account.=

# Adds JVM arg to ALL remote executions

grid.jvm\:xx\:permgen=-XX:MaxPermSize=768M

 

# Sets GridKit's "remote-runtime:jar-cache" property to determine where to store jars remotely (here we override the default [/tmp/nanocloud] to the user's [~/.jar-cache] on the remote box)

grid.remote-runtime\:jar-cache=.jar-cache

 

# Sets GridKit's "node:config-trace" property to dump ViEngine config on startup

grid.node\:config-trace=true

 

# Adds JVM args specific to remote executions on 'myGrid' nodes

grid.jvm\:xx\:mx.myGrid=-Xmx8g

grid.jvm\:exec-command.myGrid=/path/to/java

 

# User override to enable remote debug server that client can connect to

#grid.jvm\:xx\:debug.myGrid.ahiniker=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5004

 

# User override to enable remote debug client that will connect to our IDE on startup (replace the IP with your laptop's IP)

#grid.jvm\:xx\:debug.myGrid.ahiniker=-agentlib:jdwp=transport=dt_socket,server=n,address=10.170.1.45:5004,suspend=y

```

#### Examples ####

All of the examples below assume the following imports/base trait to provide some default grid definitions.  Also, it references a classpath resource of `grid.properties` that contains any configuration properties as outlined above.

NOTE: the grid definitions refer to `server1.example.com` and `server2.example.com`, these should be replaced with hostnames configured on your network.  Password-less SSH needs to be configured for each host. 

```scala

package com.hindog.grid.examples

import java.lang.management.ManagementFactory

import java.util.concurrent.Callable

import com.hindog.grid.GridConfigurable.Hook

import com.hindog.grid._

import scala.concurrent.duration._

import scala.concurrent._

trait GridExampleApp extends App {

	def message(msg: String = "Hello!") = s"$msg [thread: " + Thread.currentThread().getId + " within process: " + ManagementFactory.getRuntimeMXBean().getName() + "]"

	val configOneRemote: GridConfig = GridConfig(

		"myGrid",

		RemoteNodeConfig("server1.example.com", "server1") // host + alias

	).withPropertyOverrides(System.getProperties).withPropertyOverrides(properties)

	val configTwoRemote: GridConfig = GridConfig(

		"myGrid",

		RemoteNodeConfig("server1.example.com", "server1"), // host + alias

		RemoteNodeConfig("server2.example.com", "server2") // host + alias

	).withPropertyOverrides(System.getProperties).withPropertyOverrides(properties)

}

```

#### Scala ExecutionContext / Future Example ####

Demonstrates how we can use GridExecutor with Scala's `Future[T]` natively without any GridExecutor specific code.  Parallel collections are not supported.

```scala

object GridExecutorScalaFutureExample extends GridExampleApp {

	val remoteNodeConfig = configOneRemote.nodes.head

	val remoteNode1 = remoteNodeConfig.withName(remoteNodeConfig.name + "-1")

	val remoteNode2 = remoteNodeConfig.withName(remoteNodeConfig.name + "-2")

	val localNode1 =  LocalNodeConfig("local-1")

	val localNode2 =  LocalNodeConfig("local-2")

	// Set our config to use 2 remote and 2 local execution slots (4 total)

	val config2 = configOneRemote.withNodes(remoteNode1, remoteNode2, localNode1, localNode2)

	// create an implicit ExecutionContext to execute against

	implicit val ec = ExecutionContext.fromExecutorService(GridExecutor(config2))

	// No references to GridExecutor are present in the following code

	// This will use scala's Future to run tasks in parallel using remote JVMs

	// We throw a Thread.sleep to simulate real work, total time should reflect parallel execution completed

	// the work in less time than sequential execution

	val start = System.currentTimeMillis()

	val futures = (0 to 20).map(i => Future {

		println(message(s"executing task $i"))

		Thread.sleep(1000)

		s"result $i"

	})

	val results = Await.result(Future.sequence(futures), Duration.Inf)

	println(s"results = $results")

	println("total time: " + (System.currentTimeMillis() - start) + "ms")

	ec.shutdown()

}

```

#### Single-use Example ####

Demonstates how to initialize a grid whose life-cycle is scoped to a single task.

NOTE: the overhead in instantiating the cloud will be incurred on each invocation (as part of the future), but once the jars have sync'ed then subsequent invocations will have reduced overhead.

```scala

object GridExecutorSingleFutureExample extends GridExampleApp {

	import scala.collection.JavaConverters._

	import scala.concurrent.ExecutionContext.Implicits.global

	val fut = GridExecutor.future(configOneRemote) {

		println(message())

		System.getenv().asScala.toSeq.sortBy(_._1)

	}

	Await.result(fut, Duration.Inf).foreach(kv => println(kv._1 + "=" + kv._2))

}

```

#### Multi-use Example ####

Demonstates how to initialize a grid whose life-cycle is scoped to `thunk`.  Can be used to submit multiple tasks in an ad-hoc fashion.

```scala

object GridExecutorScopedMultiUseExample extends GridExampleApp {

	// Submit 2 tasks and print their results

	GridExecutor.withInstance(configTwoRemote) { executor =>

		val fut1 = executor.submit(new Callable[String] {

			override def call(): String = {

				println("started task A")

				Thread.sleep(5000)

				message("result A")

			}

		})

		val fut2 = executor.submit(new Callable[String] {

			override def call(): String = {

				println("started task B")

				Thread.sleep(5000)

				message("result B")

			}

		})

		println("Future 1 result: " + fut1.get())

		println("Future 2 result: " + fut2.get())

	}

}

```

#### Startup/Shutdown Hooks Example ####

Startup/Shutdown hooks allow arbirary code to be registered for execution as part of each node's startup or shutdown sequence.

```scala

object GridExecutorScopedWithInitializationExample extends Logging {

	var globalValue: String = "default value"

	def main(args: Array[String]) = {

		// Define an initialization process by using 'addStartupHook(new Hook("name") {...})'

		// below we will add a hook to set the 'globalValue' on the remote box on startup

		val baseConfig: GridConfig = GridConfig(

			"myGrid",

			RemoteNodeConfig("server1.example.com", "server1")

		).withPropertyOverrides(System.getProperties)

		val config = baseConfig.addStartupHook(new Hook("my init hook") {

			override def run(): Unit = {

				// modify our global variable

				println("running initialization...")

				println(GridExecutorScopedWithInitializationExample.globalValue)

				GridExecutorScopedWithInitializationExample.globalValue = "initialized value"

			}

		}).addShutdownHook(new Hook("my shutdown hook") {

			override def run(): Unit = {

				info("running delay")

				Thread.sleep(1000)

			}

		})

		val fut = GridExecutor.future(config) {

			// should return the value set via our initialization hook

			GridExecutorScopedWithInitializationExample.globalValue

		}

		// should reflect the init'ed value

		println("remote globalValue: " + Await.result(fut, Duration.Inf))

		// local value should be unchanged

		println("local globalValue: " + globalValue)

	}

}

```

#### Local Fork Example ####

Demonstrates how to configure a local node that can be used for running code in a forked JVM.

```scala

object GridExecutorLocalForkExample extends App {

	import scala.concurrent.ExecutionContext.Implicits.global

	println("host jvm: " + ManagementFactory.getRuntimeMXBean.getName)

	val config1: GridConfig = GridConfig.localFork("fork 1").withMaxHeap("20m").withMinHeap("20m")

	val config2: GridConfig = GridConfig.localFork("fork 2").withMaxHeap("40m").withMinHeap("40m")

	val fut1: Future[Unit] = GridExecutor.future(config1) {

		println("forked jvm 1: " + ManagementFactory.getRuntimeMXBean.getName)

		println("total memory 1: " + Runtime.getRuntime.totalMemory())

		Thread.sleep(5000)

	}

	val fut2: Future[Unit] = GridExecutor.future(config2) {

		println("forked jvm 2: " + ManagementFactory.getRuntimeMXBean.getName)

		println("total memory 2: " + Runtime.getRuntime.totalMemory())

		Thread.sleep(5000)

	}

	Await.result(Future.sequence(Seq(fut1, fut2)), 10 seconds)

}

```

### Gotchas ###

#### Spark / Hadoop Dependencies ####

For remote Spark/Hadoop execution, if your `App` class contains method signatures that reference classes from `provided` cluster jars, then the execution will fail unless those libraries are configured for `compile` scope.  Another work-around is to remove all traces of such classes in your `App` class method/field signatures and delegate to another class with your job's logic from within the body of the `run` method (method bodies aren't validated by the JVM on startup).  This will be addressed in an upcoming `2.0` release.

#### Auth Errors ####

If you experience a `JSchAuthCancelException` or similar when running, it is most likely because your SSH key is not of the required minimum length (2048 bits).  Try generating a new key that is at least 2048 bits in length. 

### TODO ###

#### Spark Shell ####

Tutorial, video or animated GIF that shows how to configure the IDE-pimped shell.

#### TypeSafe Config Support ####

Upcoming `2.x` release will have an overhauled configuration process that allows for nested/inherited grid configs.  This will minimize the effort required for configuration while also providing good flexibility for per-grid, per-host, per-job, per-user configuration options, etc.

#### Tutorials / Documentation ####

***Coming Soon***