{"id":29193557,"url":"https://github.com/papostol/spark-submit","last_synced_at":"2025-07-02T03:01:23.912Z","repository":{"id":44951971,"uuid":"417857988","full_name":"PApostol/spark-submit","owner":"PApostol","description":"Python manager for spark-submit jobs","archived":false,"fork":false,"pushed_at":"2024-01-06T11:04:34.000Z","size":55,"stargazers_count":10,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-25T22:40:50.175Z","etag":null,"topics":["apache","spark","submit"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PApostol.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-10-16T14:49:10.000Z","updated_at":"2025-06-18T07:43:50.000Z","dependencies_parsed_at":"2024-01-06T11:40:43.330Z","dependency_job_id":"5674db2d-e541-4d42-805b-76925892c3b6","html_url":"https://github.com/PApostol/spark-submit","commit_stats":{"total_commits":49,"total_committers":2,"mean_commits":24.5,"dds":"0.020408163265306145","last_synced_commit":"4f21449adf18a328e7303f8f6d7e512315818cd2"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/PApostol/spark-submit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PApostol%2Fspark-submit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PApostol%2Fspark-submit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PApostol%2Fspark-submit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PApostol%2Fspark-submit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PApostol","download_url":"https://codeload.github.com/PApostol/spark-submit/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PApostol%2Fspark-submit/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263066557,"owners_count":23408387,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","spark","submit"],"created_at":"2025-07-02T03:01:04.430Z","updated_at":"2025-07-02T03:01:23.719Z","avatar_url":"https://github.com/PApostol.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Spark-submit\n\n[![PyPI version](https://badge.fury.io/py/spark-submit.svg)](https://badge.fury.io/py/spark-submit)\n[![Downloads](https://static.pepy.tech/personalized-badge/spark-submit?period=month\u0026units=international_system\u0026left_color=grey\u0026right_color=green\u0026left_text=total%20downloads)](https://pepy.tech/project/spark-submit)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/spark-submit)](https://pypi.org/project/spark-submit/)\n[![](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![Code style: blue](https://img.shields.io/badge/code%20style-blue-blue.svg)](https://blue.readthedocs.io/)\n[![License](https://img.shields.io/badge/License-MIT-blue)](#license \"Go to license section\")\n[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/PApostol/spark-submit/issues)\n\n#### TL;DR: Python manager for spark-submit jobs\n\n### Description\nThis package allows for submission and management of Spark jobs in Python scripts via [Apache Spark's](https://spark.apache.org/) `spark-submit` functionality.\n\n### Installation\nThe easiest way to install is using `pip`:\n\n`pip install spark-submit`\n\nTo install from source:\n```\ngit clone https://github.com/PApostol/spark-submit.git\ncd spark-submit\npython setup.py install\n```\n\nFor usage details check `help(spark_submit)`.\n\n### Usage Examples\nSpark arguments can either be provided as keyword arguments or as an unpacked dictionary.\n\n##### Simple example:\n```\nfrom spark_submit import SparkJob\n\napp = SparkJob('/path/some_file.py', master='local', name='simple-test')\napp.submit()\n\nprint(app.get_state())\n```\n##### Another example:\n```\nfrom spark_submit import SparkJob\n\nspark_args = {\n    'master': 'spark://some.spark.master:6066',\n    'deploy_mode': 'cluster',\n    'name': 'spark-submit-app',\n    'class': 'main.Class',\n    'executor_memory': '2G',\n    'executor_cores': '1',\n    'total_executor_cores': '2',\n    'verbose': True,\n    'conf': [\"spark.foo.bar='baz'\", \"spark.x.y='z'\"],\n    'main_file_args': '--foo arg1 --bar arg2'\n    }\n\napp = SparkJob('s3a://bucket/path/some_file.jar', **spark_args)\nprint(app.get_submit_cmd(multiline=True))\n\n# poll state in the background every x seconds with `poll_time=x`\napp.submit(use_env_vars=True,\n           extra_env_vars={'PYTHONPATH': '/some/path/'},\n           poll_time=10\n           )\n\nprint(app.get_state()) # 'SUBMITTED'\n\nwhile not app.concluded:\n    # do other stuff...\n    print(app.get_state()) # 'RUNNING'\n\nprint(app.get_state()) # 'FINISHED'\n```\n\n#### Examples of `spark-submit` to `spark_args` dictionary:\n##### A `client` example:\n```\n~/spark_home/bin/spark-submit \\\n--master spark://some.spark.master:7077 \\\n--name spark-submit-job \\\n--total-executor-cores 8 \\\n--executor-cores 4 \\\n--executor-memory 4G \\\n--driver-memory 2G \\\n--py-files /some/utils.zip \\\n--files /some/file.json \\\n/path/to/pyspark/file.py --data /path/to/data.csv\n```\n##### becomes\n```\nspark_args = {\n    'master': 'spark://some.spark.master:7077',\n    'name': 'spark_job_client',\n    'total_executor_cores: '8',\n    'executor_cores': '4',\n    'executor_memory': '4G',\n    'driver_memory': '2G',\n    'py_files': '/some/utils.zip',\n    'files': '/some/file.json',\n    'main_file_args': '--data /path/to/data.csv'\n    }\nmain_file = '/path/to/pyspark/file.py'\napp = SparkJob(main_file, **spark_args)\n```\n##### A `cluster` example:\n```\n~/spark_home/bin/spark-submit \\\n--master spark://some.spark.master:6066 \\\n--deploy-mode cluster \\\n--name spark_job_cluster \\\n--jars \"s3a://mybucket/some/file.jar\" \\\n--conf \"spark.some.conf=foo\" \\\n--conf \"spark.some.other.conf=bar\" \\\n--total-executor-cores 16 \\\n--executor-cores 4 \\\n--executor-memory 4G \\\n--driver-memory 2G \\\n--class my.main.Class \\\n--verbose \\\ns3a://mybucket/file.jar \"positional_arg1\" \"positional_arg2\"\n```\n##### becomes\n```\nspark_args = {\n    'master': 'spark://some.spark.master:6066',\n    'deploy_mode': 'cluster',\n    'name': 'spark_job_cluster',\n    'jars': 's3a://mybucket/some/file.jar',\n    'conf': [\"spark.some.conf='foo'\", \"spark.some.other.conf='bar'\"], # note the use of quotes\n    'total_executor_cores: '16',\n    'executor_cores': '4',\n    'executor_memory': '4G',\n    'driver_memory': '2G',\n    'class': 'my.main.Class',\n    'verbose': True,\n    'main_file_args': '\"positional_arg1\" \"positional_arg2\"'\n    }\nmain_file = 's3a://mybucket/file.jar'\napp = SparkJob(main_file, **spark_args)\n```\n\n#### Testing\n\nYou can do some simple testing with local mode Spark after cloning the repo.\n\nNote any additional requirements for running the tests: `pip install -r tests/requirements.txt`\n\n`pytest tests/`\n\n`python tests/run_integration_test.py`\n\n\n#### Additional methods\n\n`spark_submit.system_info()`: Collects Spark related system information, such as versions of spark-submit, Scala, Java, PySpark, Python and OS\n\n`spark_submit.SparkJob.kill()`: Kills the running Spark job (cluster mode only)\n\n`spark_submit.SparkJob.get_code()`: Gets the spark-submit return code\n\n`spark_submit.SparkJob.get_output()`: Gets the spark-submit stdout\n\n`spark_submit.SparkJob.get_id()`: Gets the spark-submit submission ID\n\n\n### License\n\nReleased under [MIT](/LICENSE) by [@PApostol](https://github.com/PApostol).\n\n- You can freely modify and reuse.\n- The original license must be included with copies of this software.\n- Please link back to this repo if you use a significant portion the source code.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpapostol%2Fspark-submit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpapostol%2Fspark-submit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpapostol%2Fspark-submit/lists"}