{"id":15050318,"url":"https://github.com/mozilla/emr-bootstrap-spark","last_synced_at":"2025-10-04T13:30:47.298Z","repository":{"id":28681768,"uuid":"32201628","full_name":"mozilla/emr-bootstrap-spark","owner":"mozilla","description":"AWS bootstrap scripts for Mozilla's flavoured Spark setup.","archived":true,"fork":false,"pushed_at":"2020-02-13T21:05:30.000Z","size":932,"stargazers_count":47,"open_issues_count":0,"forks_count":19,"subscribers_count":31,"default_branch":"master","last_synced_at":"2025-01-16T16:25:51.055Z","etag":null,"topics":["aws","jupyter","spark","zeppelin"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mozilla.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-14T08:01:33.000Z","updated_at":"2024-10-11T11:51:48.000Z","dependencies_parsed_at":"2022-08-02T12:12:23.243Z","dependency_job_id":null,"html_url":"https://github.com/mozilla/emr-bootstrap-spark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Femr-bootstrap-spark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Femr-bootstrap-spark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Femr-bootstrap-spark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Femr-bootstrap-spark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mozilla","download_url":"https://codeload.github.com/mozilla/emr-bootstrap-spark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235257273,"owners_count":18961140,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","jupyter","spark","zeppelin"],"created_at":"2024-09-24T21:25:40.627Z","updated_at":"2025-10-04T13:30:41.984Z","avatar_url":"https://github.com/mozilla.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"emr-bootstrap-spark\n===================\n\nThis package contains the AWS bootstrap scripts for Mozilla's flavoured Spark setup.\nThe deployed scripts in S3 are referenced by\n[ATMO clusters](https://github.com/mozilla/telemetry-analysis-service) and\n[Airflow jobs](https://github.com/mozilla/telemetry-airflow).\n\n## Interactive job\n```bash\nexport SPARK_PROFILE=telemetry-spark-cloudformation-TelemetrySparkInstanceProfile-1SATUBVEXG7E3\nexport SPARK_BUCKET=telemetry-spark-emr-2\nexport KEY_NAME=20161025-dataops-dev\naws emr create-cluster \\\n  --region us-west-2 \\\n  --name SparkCluster \\\n  --instance-type c3.4xlarge \\\n  --instance-count 1 \\\n  --service-role EMR_DefaultRole \\\n  --ec2-attributes KeyName=${KEY_NAME},InstanceProfile=${SPARK_PROFILE} \\\n  --release-label emr-5.2.1 \\\n  --applications Name=Spark Name=Hive Name=Zeppelin \\\n  --bootstrap-actions Path=s3://${SPARK_BUCKET}/bootstrap/telemetry.sh \\\n  --configurations https://s3-us-west-2.amazonaws.com/${SPARK_BUCKET}/configuration/configuration.json \\\n  --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=TERMINATE_JOB_FLOW,Jar=s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=\\[\"s3://${SPARK_BUCKET}/steps/zeppelin/zeppelin.sh\"\\]\n```\n\n## Batch job\n```bash\n# Also export the vars from the 'interactive' section above.\nexport DATA_BUCKET=telemetry-public-analysis-2 # Or use the private bucket.\nexport CODE_BUCKET=telemetry-analysis-code-2\naws emr create-cluster \\\n  --region us-west-2 \\\n  --name SparkCluster \\\n  --instance-type c3.4xlarge \\\n  --instance-count 1 \\\n  --service-role EMR_DefaultRole \\\n  --ec2-attributes KeyName=${KEY_NAME},InstanceProfile=${SPARK_PROFILE} \\\n  --release-label emr-5.2.1 \\\n  --applications Name=Spark Name=Hive \\\n  --bootstrap-actions Path=s3://${SPARK_BUCKET}/bootstrap/telemetry.sh \\\n  --configurations https://s3-us-west-2.amazonaws.com/${SPARK_BUCKET}/configuration/configuration.json \\\n  --auto-terminate \\\n  --steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=TERMINATE_JOB_FLOW,Jar=s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=\\[\"s3://${SPARK_BUCKET}/steps/batch.sh\",\"--job-name\",\"foo\",\"--notebook\",\"s3://${CODE_BUCKET}/jobs/foo/Telemetry Hello World.ipynb\",\"--data-bucket\",\"${DATA_BUCKET}\"\\]\n```\n\n## Deploy to AWS via ansible\n\nTo deploy to the staging location:\n\n```bash\nansible-playbook ansible/deploy.yml -e '@ansible/envs/stage.yml' -i ansible/inventory\n```\n\nOnce deployed, you can see the effects in action by launching a cluster via\n[ATMO stage](https://atmo.stage.mozaws.net/).\n\nTo deploy for production clusters:\n\n```bash\nansible-playbook ansible/deploy.yml -e '@ansible/envs/production.yml' -i ansible/inventory\n```\n\nThe Spark Jupyter notebook configuration is hosted at `https://s3-us-west-2.amazonaws.com/telemetry-spark-emr-2/credentials/jupyter_notebook_config.py`. At the moment, this is only needed for the GitHub Gist export option in the Jupyter notebook. The credentials it contains are managed under the [Mozilla GitHub account](https://github.com/mozilla/) by :whd. This file **should not be made public**.\n\n\n## Contributing to `emr-bootstrap-spark`\n\nYou may set up a development environment to test and verify modifications applied to this repository.\n\n### Install prerequisite packages\n```\npip install ansible boto boto3\n```\n\n### Create and bootstrap the development environment\n* Define a new ansible environment in `env/dev-\u003cusername\u003e.yml`\n    * Set `spark_emr_bucket` to a unique bucket e.g. `telemetry-spark-emr-2-dev-\u003cusername\u003e`\n    * Set `stack_name` to a unique name e.g. `telemetry-spark-cloudformation-dev-\u003cusername\u003e`\n* Recursively copy assets from `staging` to `dev`\n    * `aws s3 cp --recursive s3://telemetry-spark-emr-2-stage s3://telemetry-spark-emr-2-dev-\u003cusername\u003e`\n* Deploy to AWS using `ansible-playbook` on the new environment\n* Launch a new instance using the appropriate `SPARK_PROFILE` and `SPARK_BUCKET` keys\n    * Set `SPARK_PROFILE` to the cloudformation instance profile\n        * This can be found as an output on the cloudformation dashboard\n        * Alternatively:\n            ```\n               aws cloudformation describe-stacks --stack-name telemetry-spark-cloudformation-dev-\u003cusername\u003e |\n               jq '.Stacks[0].Outputs[0].OutputValue'\n            ```\n    * Set `SPARK_BUCKET` to `spark_emr_bucket` value in `env/dev-\u003cusername\u003e.yml`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmozilla%2Femr-bootstrap-spark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmozilla%2Femr-bootstrap-spark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmozilla%2Femr-bootstrap-spark/lists"}