https://github.com/uber/remoteshuffleservice
Remote shuffle service for Apache Spark to store shuffle data on remote servers.
https://github.com/uber/remoteshuffleservice
Last synced: 7 months ago
JSON representation
Remote shuffle service for Apache Spark to store shuffle data on remote servers.
- Host: GitHub
- URL: https://github.com/uber/remoteshuffleservice
- Owner: uber
- License: other
- Created: 2020-08-20T17:41:23.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-09-29T12:15:57.000Z (over 2 years ago)
- Last Synced: 2024-05-09T07:59:23.677Z (over 1 year ago)
- Language: Java
- Homepage:
- Size: 1.47 MB
- Stars: 318
- Watchers: 19
- Forks: 98
- Open Issues: 31
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Uber Remote Shuffle Service (RSS)
Uber Remote Shuffle Service provides the capability for Apache Spark applications to store shuffle data
on remote servers. See more details on Spark community document:
[[SPARK-25299][DISCUSSION] Improving Spark Shuffle Reliability](https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit?ts=5e3c57b8).
Please contact us (**remoteshuffleservice@googlegroups.com**) for any question or feedback.
## Supported Spark Version
- The **master** branch supports **Spark 2.4.x**. The **spark30** branch supports **Spark 3.0.x**.
## How to Build
Make sure JDK 8+ and maven is installed on your machine.
#### Build RSS Server
- Run:
```
mvn clean package -Pserver -DskipTests
```
This command creates **remote-shuffle-service-xxx-server.jar** file for RSS server, e.g. target/remote-shuffle-service-0.0.9-server.jar.
### Build RSS Client
- Run:
```
mvn clean package -Pclient -DskipTests
```
This command creates **remote-shuffle-service-xxx-client.jar** file for RSS client, e.g. target/remote-shuffle-service-0.0.9-client.jar.
## How to Run
### Step 1: Run RSS Server
- Pick up a server in your environment, e.g. `server1`. Run RSS server jar file (**remote-shuffle-service-xxx-server.jar**) as a Java application, for example,
```
java -Dlog4j.configuration=log4j-rss-prod.properties -cp target/remote-shuffle-service-0.0.9-server.jar com.uber.rss.StreamServer -port 12222 -serviceRegistry standalone -dataCenter dc1
```
### Step 2: Run Spark application with RSS Client
- Upload client jar file (**remote-shuffle-service-xxx-client.jar**) to your HDFS, e.g. `hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar`
- Add configure to your Spark application like following (you need to adjust the values based on your environment):
```
spark.jars=hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
spark.executor.extraClassPath=remote-shuffle-service-0.0.9-client.jar
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.shuffle.rss.serviceRegistry.type=standalone
spark.shuffle.rss.serviceRegistry.server=server1:12222
spark.shuffle.rss.dataCenter=dc1
```
- Run your Spark application
## Run with High Availability
Remote Shuffle Service could use a [Apache ZooKeeper](https://zookeeper.apache.org/) cluster and register live service
instances in ZooKeeper. Spark applications will look up ZooKeeper to find and use active Remote Shuffle Service instances.
In this configuration, ZooKeeper serves as a **Service Registry** for Remote Shuffle Service, and we need to add those
parameters when starting RSS server and Spark application.
### Step 1: Run RSS Server with ZooKeeper as service registry
- Assume there is a ZooKeeper server `zkServer1`. Pick up a server in your environment, e.g. `server1`. Run RSS server jar file (**remote-shuffle-service-xxx-server.jar**) as a Java application on `server1`, for example,
```
java -Dlog4j.configuration=log4j-rss-prod.properties -cp target/remote-shuffle-service-0.0.9-server.jar com.uber.rss.StreamServer -port 12222 -serviceRegistry zookeeper -zooKeeperServers zkServer1:2181 -dataCenter dc1
```
### Step 2: Run Spark application with RSS Client and ZooKeeper service registry
- Upload client jar file (**remote-shuffle-service-xxx-client.jar**) to your HDFS, e.g. `hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar`
- Add configure to your Spark application like following (you need to adjust the values based on your environment):
```
spark.jars=hdfs:///file/path/remote-shuffle-service-0.0.9-client.jar
spark.executor.extraClassPath=remote-shuffle-service-0.0.9-client.jar
spark.shuffle.manager=org.apache.spark.shuffle.RssShuffleManager
spark.shuffle.rss.serviceRegistry.type=zookeeper
spark.shuffle.rss.serviceRegistry.zookeeper.servers=zkServer1:2181
spark.shuffle.rss.dataCenter=dc1
```
- Run your Spark application