https://github.com/papostol/spark-submit

Python manager for spark-submit jobs
https://github.com/papostol/spark-submit

apache spark submit

Last synced: 8 months ago
JSON representation

Python manager for spark-submit jobs

Host: GitHub
URL: https://github.com/papostol/spark-submit
Owner: PApostol
License: mit
Created: 2021-10-16T14:49:10.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2024-01-06T11:04:34.000Z (about 2 years ago)
Last Synced: 2025-06-25T22:40:50.175Z (8 months ago)
Topics: apache, spark, submit
Language: Python
Homepage:
Size: 53.7 KB
Stars: 10
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          ## Spark-submit

[![PyPI version](https://badge.fury.io/py/spark-submit.svg)](https://badge.fury.io/py/spark-submit)

[![Downloads](https://static.pepy.tech/personalized-badge/spark-submit?period=month&units=international_system&left_color=grey&right_color=green&left_text=total%20downloads)](https://pepy.tech/project/spark-submit)

[![PyPI - Downloads](https://img.shields.io/pypi/dm/spark-submit)](https://pypi.org/project/spark-submit/)

[![](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

[![Code style: blue](https://img.shields.io/badge/code%20style-blue-blue.svg)](https://blue.readthedocs.io/)

[![License](https://img.shields.io/badge/License-MIT-blue)](#license "Go to license section")

[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/PApostol/spark-submit/issues)

#### TL;DR: Python manager for spark-submit jobs

### Description

This package allows for submission and management of Spark jobs in Python scripts via [Apache Spark's](https://spark.apache.org/) `spark-submit` functionality.

### Installation

The easiest way to install is using `pip`:

`pip install spark-submit`

To install from source:

```

git clone https://github.com/PApostol/spark-submit.git

cd spark-submit

python setup.py install

```

For usage details check `help(spark_submit)`.

### Usage Examples

Spark arguments can either be provided as keyword arguments or as an unpacked dictionary.

##### Simple example:

```

from spark_submit import SparkJob

app = SparkJob('/path/some_file.py', master='local', name='simple-test')

app.submit()

print(app.get_state())

```

##### Another example:

```

from spark_submit import SparkJob

spark_args = {

    'master': 'spark://some.spark.master:6066',

    'deploy_mode': 'cluster',

    'name': 'spark-submit-app',

    'class': 'main.Class',

    'executor_memory': '2G',

    'executor_cores': '1',

    'total_executor_cores': '2',

    'verbose': True,

    'conf': ["spark.foo.bar='baz'", "spark.x.y='z'"],

    'main_file_args': '--foo arg1 --bar arg2'

    }

app = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)

print(app.get_submit_cmd(multiline=True))

# poll state in the background every x seconds with `poll_time=x`

app.submit(use_env_vars=True,

           extra_env_vars={'PYTHONPATH': '/some/path/'},

           poll_time=10

           )

print(app.get_state()) # 'SUBMITTED'

while not app.concluded:

    # do other stuff...

    print(app.get_state()) # 'RUNNING'

print(app.get_state()) # 'FINISHED'

```

#### Examples of `spark-submit` to `spark_args` dictionary:

##### A `client` example:

```

~/spark_home/bin/spark-submit \

--master spark://some.spark.master:7077 \

--name spark-submit-job \

--total-executor-cores 8 \

--executor-cores 4 \

--executor-memory 4G \

--driver-memory 2G \

--py-files /some/utils.zip \

--files /some/file.json \

/path/to/pyspark/file.py --data /path/to/data.csv

```

##### becomes

```

spark_args = {

    'master': 'spark://some.spark.master:7077',

    'name': 'spark_job_client',

    'total_executor_cores: '8',

    'executor_cores': '4',

    'executor_memory': '4G',

    'driver_memory': '2G',

    'py_files': '/some/utils.zip',

    'files': '/some/file.json',

    'main_file_args': '--data /path/to/data.csv'

    }

main_file = '/path/to/pyspark/file.py'

app = SparkJob(main_file, **spark_args)

```

##### A `cluster` example:

```

~/spark_home/bin/spark-submit \

--master spark://some.spark.master:6066 \

--deploy-mode cluster \

--name spark_job_cluster \

--jars "s3a://mybucket/some/file.jar" \

--conf "spark.some.conf=foo" \

--conf "spark.some.other.conf=bar" \

--total-executor-cores 16 \

--executor-cores 4 \

--executor-memory 4G \

--driver-memory 2G \

--class my.main.Class \

--verbose \

s3a://mybucket/file.jar "positional_arg1" "positional_arg2"

```

##### becomes

```

spark_args = {

    'master': 'spark://some.spark.master:6066',

    'deploy_mode': 'cluster',

    'name': 'spark_job_cluster',

    'jars': 's3a://mybucket/some/file.jar',

    'conf': ["spark.some.conf='foo'", "spark.some.other.conf='bar'"], # note the use of quotes

    'total_executor_cores: '16',

    'executor_cores': '4',

    'executor_memory': '4G',

    'driver_memory': '2G',

    'class': 'my.main.Class',

    'verbose': True,

    'main_file_args': '"positional_arg1" "positional_arg2"'

    }

main_file = 's3a://mybucket/file.jar'

app = SparkJob(main_file, **spark_args)

```

#### Testing

You can do some simple testing with local mode Spark after cloning the repo.

Note any additional requirements for running the tests: `pip install -r tests/requirements.txt`

`pytest tests/`

`python tests/run_integration_test.py`

#### Additional methods

`spark_submit.system_info()`: Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS

`spark_submit.SparkJob.kill()`: Kills the running Spark job (cluster mode only)

`spark_submit.SparkJob.get_code()`: Gets the spark-submit return code

`spark_submit.SparkJob.get_output()`: Gets the spark-submit stdout

`spark_submit.SparkJob.get_id()`: Gets the spark-submit submission ID

### License

Released under [MIT](/LICENSE) by [@PApostol](https://github.com/PApostol).

- You can freely modify and reuse.

- The original license must be included with copies of this software.

- Please link back to this repo if you use a significant portion the source code.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/papostol/spark-submit

Awesome Lists containing this project

README