https://github.com/mendhak/docker-spark-experimental

Experimental Spark master and worker with spark-shell and pyspark
https://github.com/mendhak/docker-spark-experimental

Last synced: 5 months ago
JSON representation

Experimental Spark master and worker with spark-shell and pyspark

Host: GitHub
URL: https://github.com/mendhak/docker-spark-experimental
Owner: mendhak
Created: 2018-07-11T20:42:17.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-07-11T21:36:25.000Z (over 7 years ago)
Last Synced: 2025-02-12T06:38:47.440Z (11 months ago)
Size: 3.91 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          Docker container with `spark-shell` and `pyspark` shell. Based on Ubuntu 16.04, with Spark 2.3.1 and Hadoop 2.7.

### Prepare containers

First rebuild the image

    docker build -t spark .

In one terminal, start the master and worker

    docker-compose up

Browse to the master: http://127.0.0.1:8080/   

Browse to the worker: http://127.0.0.1:8081/

### Try spark-shell

In another terminal

    docker-compose exec spark-master bash

In this master's bash, start a new spark shell and specify the master.

    spark-shell --master spark://spark-master:7077

Try this code sample to see it working

    val NUM_SAMPLES = 100000000

    val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ =>

    val x = math.random

    val y = math.random

    x*x + y*y < 1

    }.count()

    println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")

While it runs, watch it on the master's web page.

Exit the scala shell using `:quit`

### Try pyspark

In the master's bash, start a new pyspark shell and specify the master. 

    pyspark --master spark://spark-master:7077

Try this code sample to see it working

    import random

    num_samples = 100000000

    def inside(p):     

        x, y = random.random(), random.random()

        return x*x + y*y < 1

    count = sc.parallelize(range(0, num_samples)).filter(inside).count()

    pi = 4 * count / num_samples

    print(pi)

    sc.stop()

While it runs, watch it on the master's web page.

Exit the pyspark shell using `quit()`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mendhak/docker-spark-experimental

Awesome Lists containing this project

README