{"id":15119390,"url":"https://github.com/Sanofi-Public/emrflow","last_synced_at":"2025-09-28T02:30:54.297Z","repository":{"id":236321681,"uuid":"788533558","full_name":"Sanofi-Public/emrflow","owner":"Sanofi-Public","description":"EMRFlow is designed to simplify the process of running PySpark jobs on Amazon EMR (Elastic Map Reduce).","archived":false,"fork":false,"pushed_at":"2024-10-02T14:31:33.000Z","size":348,"stargazers_count":14,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-11T04:35:57.193Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sanofi-Public.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-18T15:47:11.000Z","updated_at":"2024-10-02T14:31:42.000Z","dependencies_parsed_at":"2024-04-26T16:06:30.526Z","dependency_job_id":null,"html_url":"https://github.com/Sanofi-Public/emrflow","commit_stats":null,"previous_names":["sanofi-public/emrflow"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2Femrflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2Femrflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2Femrflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sanofi-Public%2Femrflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sanofi-Public","download_url":"https://codeload.github.com/Sanofi-Public/emrflow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234479906,"owners_count":18840180,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-26T01:54:57.490Z","updated_at":"2025-09-28T02:30:53.939Z","avatar_url":"https://github.com/Sanofi-Public.png","language":"Python","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"readme":"# EMRFlow :cyclone:\n\n\n\u003cspan style=\"color:purple;\"\u003e**EMRFlow** \u003c/span\u003e is designed to simplify the process of running PySpark jobs on [Amazon EMR](https://aws.amazon.com/emr/) (Elastic Map Reduce). It abstracts the complexities of interacting with EMR APIs and provides an intuitive command-line interface and python library to effortlessly submit, monitor, and list your EMR PySpark jobs.\n\n\u003cspan style=\"color:purple;\"\u003e**EMRFlow** \u003c/span\u003e serves as both a library and a command-line tool.\n\nTo install `EMRFlow`, please run:\n\n```bash\npip install emrflow\n```\n## Configuration\n\nCreate an `emr_serverless_config.json` file containing the specified details and store it in your home directory\n```json\n{\n    \"application_id\": \"\",\n    \"job_role\": \"\",\n    \"region\": \"\"\n}\n```\n\n## Usage\nPlease read the [GETTING STARTED](GETTING_STARTED.md) to integrate \u003cspan style=\"color:purple;\"\u003e**EMRFlow** \u003c/span\u003e into your project.\n\n\u003cspan style=\"color:purple;\"\u003e**EMRFlow** \u003c/span\u003e offers several commands to manage your Pypark jobs. Let's explore some key functionalities:\n\n\n### Help\n```bash\nemrflow serverless --help\n```\n![Serverless Options](images/emr-serverless-help.png)\n\n\n### Package Dependencies\n\nYou will need to package dependencies before running an EMR job if you have external libraries needing to be installed or local imports from your code base. See Scenario 2-4 in [GETTING STARTED](GETTING_STARTED.md).\n```bash\nemrflow serverless package-dependencies --help\n```\n![Serverless Options](images/emr-serverless-package-dependencies-help.png)\n\n\n\n\n### Submit PySpark Job\n```bash\nemrflow serverless run --help\n```\n![Serverless Options](images/emr-serverless-run-help.png)\n```bash\nemrflow serverless run \\\n        --job-name \"\u003cjob-name\u003e\" \\\n        --entry-point \"\u003clocation-of-main-python-file\u003e\" \\\n        --spark-submit-parameters \" --conf spark.executor.cores=8 \\\n                                    --conf spark.executor.memory=32g \\\n                                    --conf spark.driver.cores=8 \\\n                                    --conf spark.driver.memory=32g \\\n                                    --conf spark.dynamicAllocation.maxExecutors=100\" \\\n        --s3-code-uri \"s3://\u003cemr-s3-path\u003e\" \\\n        --s3-logs-uri \"s3://\u003cemr-s3-path\u003e/logs\" \\\n        --execution-timeout 0 \\\n        --ping-duration 60 \\\n        --wait \\\n        --show-output\n```\n\n\n### List Previous Runs\n```bash\nemrflow serverless list-job-runs --help\n```\n\n### Get Logs of Previous Runs\n```bash\nemrflow serverless get-logs --help\n```\n![Serverless Options](images/emr-serverless-logs-help.png)\n\n\n\n## Use EMRFlow as an API\n```Python\nimport os\nfrom emrflow import emr_serverless\n\n# initialize connection\nemr_serverless.init_connection()\n\n# submit job to EMR Serverless\nemr_job_id = emr_serverless.run(\n    job_name=\"\u003cjob-name\u003e\",\n    entry_point=\"\u003clocation-of-main-python-file\u003e\",\n    spark_submit_parameters=\"--conf spark.executor.cores=8 \\\n                            --conf spark.executor.memory=32g \\\n                            --conf spark.driver.cores=8 \\\n                            --conf spark.driver.memory=32g \\\n                            --conf spark.dynamicAllocation.maxExecutors=100\",\n    wait=True,\n    show_output=True,\n    s3_code_uri=\"s3://\u003cemr-s3-path\u003e\",\n    s3_logs_uri=\"s3://\u003cemr-s3-path\u003e/logs\",\n    execution_timeout=0,\n    ping_duration=60,\n    tags=[\"env:dev\"],\n)\nprint(emr_job_id)\n```\n\n\n**And so much more.......!!!**\n\n\n## Contributing\n\nWe welcome contributions to EMRFlow. Please open an issue discussing the change you would like to see. Create a feature branch to work on that issue and open a Pull Request once it is ready for review.\n\n### Code style\n\nWe use [black](https://black.readthedocs.io/en/stable/) as a code formatter. The easiest way to ensure your commits are always formatted with the correct version of `black` it is to use [pre-commit](https://pre-commit.com/): install it and then run `pre-commit install` once in your local copy of the repo.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSanofi-Public%2Femrflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FSanofi-Public%2Femrflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FSanofi-Public%2Femrflow/lists"}