Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/matz-e/sparkmanager
Small shim to manage Spark in a more convenient way
https://github.com/matz-e/sparkmanager
apache-spark
Last synced: 17 days ago
JSON representation
Small shim to manage Spark in a more convenient way
- Host: GitHub
- URL: https://github.com/matz-e/sparkmanager
- Owner: matz-e
- License: mit
- Created: 2018-02-22T14:25:48.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-03-16T10:34:16.000Z (over 4 years ago)
- Last Synced: 2024-10-11T08:22:31.917Z (about 1 month ago)
- Topics: apache-spark
- Language: Python
- Homepage:
- Size: 45.9 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
Spark Management Consolidated
=============================A small module that will load as a singleton class object to manage Spark
related things.Installation
------------Directly via ``pip`` on the command line, in a `virtualenv`:
.. code:: shell
pip install https://github.com/matz-e/sparkmanager/tarball/master
or for the current user:
.. code:: shell
pip install --user https://github.com/matz-e/sparkmanager/tarball/master
Usage
-----The module itself acts as a mediator to Spark:
.. code:: python
import sparkmanager as sm
# Create a new application
sm.create("My fancy name",
[("spark.executor.cores", 4), ("spark.executor.memory", "8g")])data = sm.spark.range(5)
# Will show up in the UI with the name "broadcasting some data"
with sm.jobgroup("broadcasting some data"):
data = sm.broadcast(data.collect())The Spark session can be accessed via ``sm.spark``, the Spark context via
``sm.sc``. Both attributes are instantiated once the ``create`` method is
called, with the option to call unambiguous methods from both directly via
the :py:class:`SparkManager` object:.. code:: python
# The following two calls are equivalent
c = sm.parallelize(range(5))
d = sm.sc.parallelize(range(5))
assert c.collect() == d.collect()Cluster support scripts
-----------------------.. note::
Scripts to run on the cluster are still somewhat experimental and should
be used with caution!Environment setup
~~~~~~~~~~~~~~~~~To create a self-contained Spark environment, the script provided in
``examples/env.sh`` can be used. It is currently tuned to the requirements of
the `bbpviz` cluster. A usage example:.. code:: shell
SPARK_ROOT=/path/to/my/spark/installation SM_WORKDIR=/path/to/a/work/directory examples/env.sh
The working directory will contain:
* A Python virtual environment
* A basic Spark configuration pointing to directories within the working
directory
* An environment script to establish the setupTo use the resulting working environment:
.. code:: shell
. /path/to/a/work/directory/env.sh
Spark deployment on allocations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Within a cluster allocation, the script ``sm_cluster`` can be used to start
a Spark cluster. The script will be automatically installed by `pip`. To
use it, pass either a working directory containing an environment or
specify them separately:.. code:: shell
sm_cluster startup $WORKDIR
sm_cluster startup $WORKDIR /path/to/some/env.shSimilar, to stop a cluster (not necessary with slurm):
.. code:: shell
sm_cluster shutdown $WORKDIR
sm_cluster shutdown $WORKDIR /path/to/some/env.shSpark applications then can connect to a master found via:
.. code:: shell
cat $WORKDIR/spark_master
TL;DR on BlueBrain 5
~~~~~~~~~~~~~~~~~~~~Setup a Spark environment in your current shell, and point `WORKDIR` to a
shared directory. `SPARK_HOME` needs to be in your environment and point to
your Spark installation. By default, only a file with the Spark master and
the cluster launch script will be copied to `WORKDIR`. Then submit a
cluster with:.. code:: shell
sbatch -A proj16 -t 24:00:00 -N4 --exclusive -C nvme $(which sm_cluster) startup $WORKDIR