https://github.com/radeity/spark-proxy

push-based calculation for spark application
https://github.com/radeity/spark-proxy

distributed-computing spark volunteer-computing

Last synced: 5 months ago
JSON representation

push-based calculation for spark application

Host: GitHub
URL: https://github.com/radeity/spark-proxy
Owner: Radeity
License: apache-2.0
Created: 2022-11-24T15:17:01.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-05-16T07:04:53.000Z (over 2 years ago)
Last Synced: 2025-06-10T15:45:43.996Z (7 months ago)
Topics: distributed-computing, spark, volunteer-computing
Language: Java
Homepage:
Size: 154 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Spark-Proxy

[![Field Badge](https://img.shields.io/badge/Distributed%20Computing-pink.svg)](mailto:wangweirao16@gmail.com)

[![Field Badge](https://img.shields.io/badge/Volunteer%20Computing-purple.svg)](mailto:wangweirao16@gmail.com)

Spark-Proxy supports push-based calculation for `Spark` job via aop. It can intercept launch task messages in Spark's driver, send them to external `Dispatcher` which have been registered in register center (simply use Redis now), `Dispatcher` maintains connection with external `Worker`, tasks will be finally executed on external `Worker`. It's just a demo now, and has much room for improvement.

### ALWAYS WORK IN PROGRESS : )




## System Design



## Code Structure

**Agent**

`Agent/src/main/java/fdu/daslab/SparkClientAspect.java`

**Remote Dispatcher**

`Dispatcher/src/main/java/org/apache/spark/java/dispatcher/Dispatcher.java`

**Remote Worker**

`Worker/src/main/java/org/apache/spark/worker/Worker.java`





## Quick Start

> Please make sure that version of Spark dependency (Agent/pom.xml) and your Spark cluster must be consistent.

**Package:**

```shell

mvn clean scala:compile compile package

```




**Add Spark configuration to support aop:**

1. Download `aspectjweaver-1.9.6.jar` from mvn repository

2. Edit `$SPARK_HOME/conf/spark_env.sh` (Standalone Mode)

  ```shell

  export SPARK_SUBMIT_OPTS="-javaagent:{DOWNLOAD_JAR_PATH}/aspectjweaver-1.9.6.jar"

  ```

3. Edit `$SPARK_HOME/conf/spark_default.conf` (Yarn Mode)

  ```shell

  spark.executor.extraJavaOptions  "-javaagent:{DOWNLOAD_JAR_PATH}/aspectjweaver-1.9.6.jar"

  spark.driver.extraJavaOptions    "-javaagent:{DOWNLOAD_JAR_PATH}/aspectjweaver-1.9.6.jar"

  ```

4. Move aop jar to spark resource path:

  ```shell

  mv common/target/common-1.0-SNAPSHOT.jar $SPARK_HOME/jars

  mv Agent/target/Agent-1.0-SNAPSHOT.jar $SPARK_HOME/jars

  mv jedis-4.3.1.jar (Download from https://repo1.maven.org/maven2/redis/clients/jedis/4.3.1/jedis-4.3.1.jar) $SPARK_HOME/jars

  ```

5. Move `Agent/src/main/resources/common.properties` to Spark conf directory.




**Create home directory for external Spark service (Dispatcher and Worker)**

1. Assume that home directory is `~/external-spark`

2. Create environment variable `export EXTERNAL_SPARK_HOME=~/external-spark`

3. Run following commands to quickly fill up home directory.

  ```shell

  mkdir ${EXTERNAL_SPARK_HOME}/jars

  cp -r sbin/ $EXTERNAL_SPARK_HOME/sbin

  cp -r conf/ $EXTERNAL_SPARK_HOME/conf

  mv Dispatcher/target/Dispatcher-1.0-SNAPSHOT-jar-with-dependencies.jar $EXTERNAL_SPARK_HOME/jars

  mv Worker/target/Worker-1.0-SNAPSHOT-jar-with-dependencies.jar $EXTERNAL_SPARK_HOME/jars

  mv {DOWNLOAD_JAR_PATH}/aspectjweaver-1.9.6.jar $EXTERNAL_SPARK_HOME/jars

  ```

Directory tree will be:

  ```text

|-- conf

|   |-- common.properties

|   `-- hosts

|-- dispatcher-jars

|   |-- aspectjweaver-1.9.6.jar

|   `-- Dispatcher-1.0-SNAPSHOT-jar-with-dependencies.jar

|-- sbin

|   |-- external-spark-class.sh

|   |-- external-spark-daemon.sh

|   |-- start-all.sh

|   |-- start-dispatcher.sh

|   |-- start-worker.sh

|   |-- stop-all.sh

|   |-- stop-dispatcher.sh

|   `-- stop-worker.sh

|-- tmp

|   |-- external-spark-workflow-org.apache.spark.java.dispatcher.Dispatcher.log

|   |-- external-spark-workflow-org.apache.spark.java.dispatcher.Dispatcher.pid

|   |-- external-spark-workflow-org.apache.spark.worker.Worker.log

|   `-- external-spark-workflow-org.apache.spark.worker.Worker.pid

`-- worker-jars

    `-- Worker-1.0-SNAPSHOT-jar-with-dependencies.jar

  ```




**common.properties**

| Property Name | Default Value | Meaning |

|--|--|--|

|reschedule.dst.executor|external|The default value means that re-scheduler each tasks to external workers. If you don't want to re-schedule, set this value to internal.|

|redis.host|(none)|Redis instance host, used to connect to redis(registry center).|

|redis.password|(none)|Redis instance password, used to connect to redis(registry center).|

|host.selector|RANDOM|The strategy to select Worker, now support random selector only.|

|dispatcher.port|(none)|Launch port of Dispatcher.|




**Environment Variable**

| Variable Name | Meaning |

|--|--|

|EXTERNAL_SPARK_HOME|The HOME of external Spark.|

|EXTERNAL_SPARK_CONF_DIR|Alternate conf dir. Default is ${EXTERNAL_SPARK_HOME}/conf.|

|EXTERNAL_SPARK_PID_DIR|The pid files are stored. Default is ${EXTERNAL_SPARK_HOME}/tmp.|

|EXTERNAL_SPARK_LOG_MAX_FILES|Max log files of external Spark daemons can rotate to. Default is 5.|

|EXTERNAL_SPARK_DISPATCHER_JAR_DIR|Dispatcher jar path. Default is ${EXTERNAL_SPARK_HOME}/dispatcher-jars.|

|EXTERNAL_SPARK_WORKER_JAR_DIR|Worker jar path. Default is ${EXTERNAL_SPARK_HOME}/worker-jars.|

|EXTERNAL_APPLICATION_JAR_DIR|Application jar path, for loading running application jar. Default is null.|




**Host File (conf/hosts)**

```text

[dispatcher]

10.176.24.58

[worker]

10.176.24.59

10.176.24.60

```




**Submit application:**

1. Standalone Mode

- Configure `reschedule.dst.executor` in `common.properties` which decides re-scheduler to internal or external executors. Move to `$SPARK_HOME/conf`

- Replace `jarDir` in `TaskRunner` with example JAR path in external executor (will support auto-fetching later).

- Launch external Spark service: `bash sbin/start-all.sh`

- Submit Spark application

  ```shell

  spark-submit --class org.apache.spark.examples.SparkPi --master spark://analysis-5:7077 $SPARK_HOME/examples/jars/spark-examples_2.12-3.1.2.jar 10

  ```

2. Yarn Mode

- Move `common.properties` to `HADOOP_CONF_DIR`, due to lack of some configuration, will make sure how to avoid this step later.

- Support Remote Shuffle Service [(apache/incubator-celeborn)](https://github.com/apache/incubator-celeborn).

    - Follow celeborn doc to deploy celeborn cluster.

    - Move celeborn client jar to `$SPARK_HOME/jars`, also to external worker node.

    - Update `Dispatcher/pom.xml` and set your own celeborn client jar path.

    - Add celeborn-related spark configuration refer to celeborn doc.

- Other steps remain the same with standalone mode.




## Futurn Plan (unordered)

- [x] Add launch script or Launcher, make it automatically.

- [ ] Try to figure out better scheduler strategy, which have to support task graph generation (research work).

- [x] Maintain different spark config for different Spark application, set different config and create different `SparkEnv` in Worker

- [x] Worker selector.

- [ ] Support auto-fetching JAR files. 

- [ ] Validate correctness in shuffle task (security issue).

- [ ] Synchronize re-dispatch info with Driver.

- [ ] External worker can be aware of and register to new driver automatically.

- [ ] Support whole life cycle management of external executor (start, stop, listening).

- [ ] Support `IndirectTaskResult`.

- [ ] Support metrics report.

- [ ] Package external worker, and support dynamically edit properties such as application and rss jar path.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/radeity/spark-proxy

Awesome Lists containing this project

README