{"id":18367670,"url":"https://github.com/jwplayer/sparksteps","last_synced_at":"2025-04-06T16:32:47.468Z","repository":{"id":50155696,"uuid":"60855386","full_name":"jwplayer/sparksteps","owner":"jwplayer","description":":star: CLI tool to launch Spark jobs on AWS EMR","archived":false,"fork":false,"pushed_at":"2023-10-18T01:49:23.000Z","size":221,"stargazers_count":67,"open_issues_count":2,"forks_count":12,"subscribers_count":23,"default_branch":"master","last_synced_at":"2025-03-22T03:51:13.035Z","etag":null,"topics":["aws","aws-emr","python","spark"],"latest_commit_sha":null,"homepage":"http://spark-steps.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jwplayer.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-06-10T14:51:13.000Z","updated_at":"2024-01-25T10:15:59.000Z","dependencies_parsed_at":"2024-11-05T23:36:50.255Z","dependency_job_id":null,"html_url":"https://github.com/jwplayer/sparksteps","commit_stats":{"total_commits":97,"total_committers":9,"mean_commits":"10.777777777777779","dds":0.6597938144329897,"last_synced_commit":"8809ab42f22017aee9945bce8f7b3f3b70674bf8"},"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwplayer%2Fsparksteps","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwplayer%2Fsparksteps/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwplayer%2Fsparksteps/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwplayer%2Fsparksteps/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jwplayer","download_url":"https://codeload.github.com/jwplayer/sparksteps/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247513051,"owners_count":20950979,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-emr","python","spark"],"created_at":"2024-11-05T23:22:51.602Z","updated_at":"2025-04-06T16:32:47.176Z","avatar_url":"https://github.com/jwplayer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Spark Steps\n===========\n\n.. image:: https://github.com/jwplayer/sparksteps/workflows/Tests/badge.svg?branch=master\n    :target: https://github.com/jwplayer/sparksteps/actions?query=workflow%3ATests+branch%3Amaster\n    :alt: Build Status\n\n.. image:: https://readthedocs.org/projects/spark-steps/badge/?version=latest\n    :target: http://spark-steps.readthedocs.io/en/latest/?badge=latest\n    :alt: Documentation Status\n\nSparkSteps allows you to configure your EMR cluster and upload your\nspark script and its dependencies via AWS S3. All you need to do is\ndefine an S3 bucket.\n\nInstall\n-------\n\n::\n\n    pip install sparksteps\n\nCLI Options\n-----------\n\n::\n\n    Prompt parameters:\n      app                           main spark script for submit spark (required)\n      app-args:                     arguments passed to main spark script\n      app-list:                     Space delimited list of applications to be installed on the EMR cluster (Default: Hadoop Spark)\n      aws-region:                   AWS region name\n      bid-price:                    specify bid price for task nodes\n      bootstrap-script:             include a bootstrap script (s3 path)\n      cluster-id:                   job flow id of existing cluster to submit to\n      debug:                        allow debugging of cluster\n      defaults:                     cluster configurations of the form \"\u003cclassification1\u003e key1=val1 key2=val2 ...\"\n      dynamic-pricing-master:       use spot pricing for the master nodes.\n      dynamic-pricing-core:         use spot pricing for the core nodes.\n      dynamic-pricing-task:         use spot pricing for the task nodes.\n      ebs-volume-size-core:         size of the EBS volume to attach to core nodes in GiB.\n      ebs-volume-type-core:         type of the EBS volume to attach to core nodes (supported: [standard, gp2, io1]).\n      ebs-volumes-per-core:         the number of EBS volumes to attach per core node.\n      ebs-optimized-core:           whether to use EBS optimized volumes for core nodes.\n      ebs-volume-size-task:         size of the EBS volume to attach to task nodes in GiB.\n      ebs-volume-type-task:         type of the EBS volume to attach to task nodes.\n      ebs-volumes-per-task:         the number of EBS volumes to attach per task node.\n      ebs-optimized-task:           whether to use EBS optimized volumes for task nodes.\n      ec2-key:                      name of the Amazon EC2 key pair\n      ec2-subnet-id:                Amazon VPC subnet id\n      help (-h):                    argparse help\n      jobflow-role:                 Amazon EC2 instance profile name to use (Default: EMR_EC2_DefaultRole)\n      service-role:                 AWS IAM service role to use for EMR (Default: EMR_DefaultRole)\n      keep-alive:                   whether to keep the EMR cluster alive when there are no steps\n      log-level (-l):               logging level (default=INFO)\n      instance-type-master:         instance type of of master host (default='m4.large')\n      instance-type-core:           instance type of the core nodes, must be set when num-core \u003e 0\n      instance-type-task:           instance type of the task nodes, must be set when num-task \u003e 0\n      maximize-resource-allocation: sets the maximizeResourceAllocation property for the cluster to true when supplied.\n      name:                         specify cluster name\n      num-core:                     number of core nodes\n      num-task:                     number of task nodes\n      release-label:                EMR release label\n      s3-bucket:                    name of s3 bucket to upload spark file (required)\n      s3-path:                      path within s3-bucket to use when writing assets\n      s3-dist-cp:                   s3-dist-cp step after spark job is done\n      submit-args:                  arguments passed to spark-submit\n      tags:                         EMR cluster tags of the form \"key1=value1 key2=value2\"\n      uploads:                      files to upload to /home/hadoop/ in master instance\n      wait:                         poll until all steps are complete (or error)\n\nExample\n-------\n\n::\n\n      AWS_S3_BUCKET = \u003cinsert-s3-bucket\u003e\n      cd sparksteps/\n      sparksteps examples/episodes.py \\\n        --s3-bucket $AWS_S3_BUCKET \\\n        --aws-region us-east-1 \\\n        --release-label emr-4.7.0 \\\n        --uploads examples/lib examples/episodes.avro \\\n        --submit-args=\"--deploy-mode client --jars /home/hadoop/lib/spark-avro_2.10-2.0.2-custom.jar\" \\\n        --app-args=\"--input /home/hadoop/episodes.avro\" \\\n        --tags Application=\"Spark Steps\" \\\n        --debug\n\nThe above example creates an EMR cluster of 1 node with default instance\ntype *m4.large*, uploads the pyspark script episodes.py and its\ndependencies to the specified S3 bucket and copies the file from S3 to\nthe cluster. Each operation is defined as an EMR \"step\" that you can\nmonitor in EMR. The final step is to run the spark application with\nsubmit args that includes a custom spark-avro package and app args\n\"--input\".\n\nRun Spark Job on Existing Cluster\n---------------------------------\n\nYou can use the option ``--cluster-id`` to specify a cluster to upload\nand run the Spark job. This is especially helpful for debugging.\n\nDynamic Pricing\n-----------------------\n\nUse CLI option ``--dynamic-pricing-\u003cinstance-type\u003e`` to allow sparksteps to dynamically\ndetermine the best bid price for EMR instances within a certain instance group.\n\nCurrently the algorithm looks back at spot history over the last 12\nhours and calculates ``min(0.8 * on_demand_price, 1.2 * max_spot_price)`` to\ndetermine bid price. That said, if the current spot price is over 80% of\nthe on-demand cost, then on-demand instances are used to be\nconservative.\n\n\nTesting\n-------\n\n::\n\n    make test\n\nBlog\n----\nRead more about sparksteps in our blog post here:\nhttps://www.jwplayer.com/blog/sparksteps/\n\nLicense\n-------\n\nApache License 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjwplayer%2Fsparksteps","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjwplayer%2Fsparksteps","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjwplayer%2Fsparksteps/lists"}