Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gerashegalov/rapids-shell
Utility to run/debug Spark RAPIDS in REPL
https://github.com/gerashegalov/rapids-shell
rapids repl spark
Last synced: 3 months ago
JSON representation
Utility to run/debug Spark RAPIDS in REPL
- Host: GitHub
- URL: https://github.com/gerashegalov/rapids-shell
- Owner: gerashegalov
- Created: 2021-02-22T02:26:47.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2023-11-06T13:39:33.000Z (about 1 year ago)
- Last Synced: 2024-04-29T22:23:11.491Z (9 months ago)
- Topics: rapids, repl, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 115 KB
- Stars: 7
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# rapids-shell
This repo was started as a wrapper around Spark REPLs for easier use with the Spark RAPIDS plugin. Lately I have been putting more
effort in maintaining [standalone Jupyter notebooks][1] that can be easily started without the wrapper script, and particularly
easy to simply open them in VSCode with the Jupyter extension.[1]: ./src/jupyter
# Original Utility
A utility to start RAPIDS-enabled Spark Shell with access to unit tests resources from https://github.com/NVIDIA/spark-rapids
Before running the examples make sure to at least execute `mvn package` in your local spark-rapids repo if you are not using binaries.## Comand line options
See `rapids.sh --help` for up to date information
```
Usage: rapids.sh [OPTION]
Options:
--debug
enable bash tracing
-h, --help
prints this message
-l4j=LOG4J_CONF_FILE, --log4j-file=LOG4J_CONF_FILE
LOG4J_CONF_FILE location of a custom log4j config for local mode
-nsys, --nsys-profile
run with Nsights profile
-m=MASTER, --master=MASTER
specify MASTER for spark command, default is local[-cluster], see --num-local-execs
-n, --dry-run
generates and prints the spark submit command without executing
-nle=N, --num-local-execs=N
specify the number of local executors to use, default is 2. If > 1 use pseudo-distributed
local-cluster, otherwise local[*]
-uecp, --use-extra-classpath
use extraClassPath instead of --jars to add RAPIDS jars to spark-submit (default)
-uj, --use-jars
use --jars instead of extraClassPath to add RAPIDS jars to spark-submit
--ucx-shim=spark<3xy>
Spark buildver to populate shim-dependent package name of RapidsShuffleManager.
Will be replaced by a Boolean option
-cmd=CMD, --spark-command=CMD
specify one of spark-submit (default), spark-shell, pyspark, jupyter, jupyter-lab
-dopts=EOPTS, --driver-opts=EOPTS
pass EOPTS as --driver-java-options
-eopts=EOPTS, --executor-opts=EOPTS
pass EOPTS as spark.executor.extraJavaOptions
--gpu-fraction=GPU_FRACTION
GPU share per executor JVM unless local or local-cluster mode, see spark.rapids.memory.gpu.allocFraction
```## Environment variables
- `SPARK_RAPIDS_HOME` - the path either to the local repo or to the location used for downloading the [binaries](https://nvidia.github.io/spark-rapids/docs/download.html)
- `SPARK_HOME` - the path either to the local Spark repo or to the root fo binary distro
- `SPARK_CMD` - one of `spark-shell`, `spark-submit` (default), `pyspark`, `jupyter`, `jupyter-lab`
## Examples
Use Spark RAPIDS in Jupyter notebook
```bash
SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 SPARK_CMD=jupyter[-lab] rapids.sh
```Run in pseudo-distirbuted `local-cluster` mode
```bash
NUM_LOCAL_EXECS=2 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh
```Allow attaching a java debugger to the driver JVM
```bash
JDBSTR=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 SPARK_HOME=~/spark-3.1.1-bin-hadoop3.2 rapids.sh
```## Running Spark RAPIDS ScalaTests in `spark-shell` once started
Single test suite
```scala
scala> run(new com.nvidia.spark.rapids.InsertPartition311Suite)
InsertPartition311Suite:
...
```Single test case
```scala
scala> run(new com.nvidia.spark.rapids.HashAggregatesSuite, "sum(floats) group by more_floats 2 partitions")
HashAggregatesSuite:
...
```## Using integration test datagens
In pyspark based drivers one can use data generators from spark-rapids/integration-tests or run whole pytests.
Add `rapids.py` as an ipython startup file, e.g. on *NIX
```bash
cp src/python/rapids.py ~/.ipython/profile_default/startup/
```### Datagen
```python
key_data_gen = StructGen([
('a', IntegerGen(min_val=0, max_val=4)),
('b', IntegerGen(min_val=5, max_val=9)),
], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)...
```### Pytest
```python
runpytest('test_struct_count_distinct')
```