Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zoltan-nz/learning-spark
Playing with Apache Spark
https://github.com/zoltan-nz/learning-spark
apache-spark java map-reduce spark
Last synced: 24 days ago
JSON representation
Playing with Apache Spark
- Host: GitHub
- URL: https://github.com/zoltan-nz/learning-spark
- Owner: zoltan-nz
- Created: 2018-06-09T12:02:41.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-06-17T11:10:59.000Z (over 6 years ago)
- Last Synced: 2024-11-21T09:06:44.783Z (3 months ago)
- Topics: apache-spark, java, map-reduce, spark
- Language: Java
- Homepage:
- Size: 105 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Playing with Apache Spark
## First Step: Running Spark development environment
Start with Quick Start from the official documentation: https://spark.apache.org/docs/latest/quick-start.html
Let's implement the Self-Contained Application.
Source: https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications
1. Create a new Maven project.
2. Add the code from the documentation.
3. Save a huge sample text in your `resources` folder.**Issue #1**: The code will not work without a little change.
We have to use type casting in the lambda function.
`(FilterFunction)`
**Issue #2**: It is much easier to debug your code if you run Spark server in local mode.
* Allowed Master URLs: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
`.config("spark.master", "local")`
The right code:
```java
package nz.zoltan;import org.apache.spark.api.java.function.FilterFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;/**
* Hello world!
*/
public class App {
public static void main(String[] args) {String logFile = "SimpleApp/src/main/resources/sample-text/toldi.txt"; // Should be some file on your system
SparkSession spark = SparkSession
.builder()
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate();
Dataset logData = spark.read().textFile(logFile).cache();long numAs = logData.filter((FilterFunction) s -> s.contains("a")).count();
long numBs = logData.filter((FilterFunction) s -> s.contains("b")).count();
long numToldis = logData.filter((FilterFunction) s -> s.contains("Toldi")).count();System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs + ", lines with Toldi: " + numToldis);
spark.stop();
}
}
```More details about single developer mode: https://stackoverflow.com/questions/38008330/spark-error-a-master-url-must-be-set-in-your-configuration-when-submitting-a
**Master URLs**: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
## Second Step: Using RDD
Read this: https://spark.apache.org/docs/latest/rdd-programming-guide.html
More details in [`RDDApp`](RDDApp/src/main/java/nz/zoltan/RDDApp.java)
Counting and sorting words in a huge text file:
```java
JavaRDD lines = sc.textFile(BIG_TEXT_FILE_LOCATION).cache();JavaRDD words = lines.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
JavaPairRDD wordsWithOne = words.mapToPair(word -> new Tuple2<>(word, 1));
JavaPairRDD wordsWithCount = wordsWithOne.reduceByKey((a, b) -> a + b);
JavaPairRDD countsWithWord = wordsWithCount.mapToPair(Tuple2::swap);
JavaPairRDD sortedCounts = countsWithWord.sortByKey();sortedCounts.collect().forEach((tuple) -> System.out.println(tuple._2 + ": " + tuple._1));
```## Notes in terms of Maven
* Using multi module maven structure. More information about building a multi module maven project: https://books.sonatype.com/mvnex-book/reference/multimodule.html
* Add `maven-exec-plugin` to run the app## Using Docker
**Creating a Docker file**
I created a lightweight Java and Maven container.
Inspirations:
* Java installation based on this [Dockerfile](https://github.com/docker-library/openjdk/blob/dd54ae37bc44d19ecb5be702d36d664fed2c68e4/8/jdk/alpine/Dockerfile)
* Maven installation based on this [Dockerfile](https://github.com/Zeika/alpine-maven/blob/master/jdk8/Dockerfile)
* Docker-maven [Dockerfile](https://github.com/carlossg/docker-maven)```java
$ docker build -t learning-spark .
$ docker run learning-spark:latest mvn --pl SimpleApp exec:java
```