{"id":15208881,"url":"https://github.com/haabiy/emrrunner","last_synced_at":"2026-02-18T18:01:24.338Z","repository":{"id":227931787,"uuid":"772302252","full_name":"Haabiy/EMRRunner","owner":"Haabiy","description":"A powerful CLI tool for managing Python-based jobs on Amazon EMR clusters.","archived":false,"fork":false,"pushed_at":"2025-04-05T08:22:27.000Z","size":12900,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T08:27:33.536Z","etag":null,"topics":["cloud-computing","distributed-systems","emr","flask","software-engineering"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/emrrunner/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Haabiy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-14T23:26:11.000Z","updated_at":"2025-04-05T08:22:30.000Z","dependencies_parsed_at":"2024-03-15T21:38:03.025Z","dependency_job_id":"5aa49c8d-f23e-4860-a0af-f5369869b2bf","html_url":"https://github.com/Haabiy/EMRRunner","commit_stats":{"total_commits":41,"total_committers":1,"mean_commits":41.0,"dds":0.0,"last_synced_commit":"b0b5f1b5212add41e5825f33997e3ef1c5e6772c"},"previous_names":["haabiy/emrrunner"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haabiy%2FEMRRunner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haabiy%2FEMRRunner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haabiy%2FEMRRunner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Haabiy%2FEMRRunner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Haabiy","download_url":"https://codeload.github.com/Haabiy/EMRRunner/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249060715,"owners_count":21206383,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloud-computing","distributed-systems","emr","flask","software-engineering"],"created_at":"2024-09-28T07:03:00.814Z","updated_at":"2026-02-18T18:01:24.332Z","avatar_url":"https://github.com/Haabiy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# EMRRunner\n\n![Python](https://img.shields.io/badge/Python-3776AB?style=for-the-badge\u0026logo=python\u0026logoColor=white) \n![Amazon EMR](https://img.shields.io/badge/Amazon%20EMR-FF9900?style=for-the-badge\u0026logo=amazon-aws\u0026logoColor=white)\n![Flask](https://img.shields.io/badge/Flask-000000?style=for-the-badge\u0026logo=flask\u0026logoColor=white)\n![AWS](https://img.shields.io/badge/AWS-232F3E?style=for-the-badge\u0026logo=amazon-aws\u0026logoColor=white)\n\nA powerful command-line tool for managing and deploying Python-based (e.g., PySpark) data pipeline jobs on Amazon EMR clusters.\n\n## 🚀 Features\n\n- Command-line interface for quick job submission\n- Basic POST API for fast job submission\n- Support for both client and cluster deploy modes\n\n## 📋 Prerequisites\n\n- Python 3.9+\n- AWS Account with EMR access\n- Configured AWS credentials\n- Active EMR cluster\n\n## 🛠️ Installation\n\n### From PyPI\n```bash\npip install emrrunner\n```\n\n### From Source\n```bash\n# Clone the repository\ngit clone https://github.com/Haabiy/EMRRunner.git \u0026\u0026 cd EMRRunner\n\n# Create and activate virtual environment\npython -m venv venv\nsource venv/bin/activate\n\n# Install the package\npip install -e .\n```\n\n## ⚙️ Configuration\n\n### AWS Configuration\n\nCreate a `.env` file in the project root with your AWS configuration or export these variables in your terminal before running:\n```Bash\nexport AWS_ACCESS_KEY_ID=\"your_access_key\"\nexport AWS_SECRET_ACCESS_KEY=\"your_secret_key\"\nexport AWS_REGION=\"your_region\"\nexport EMR_CLUSTER_ID=\"your_cluster_id\"\nexport S3_PATH=\"s3://your-bucket/path\" # The path to your jobs (the directory containing your job_package.zip file)...see `S3 Job Structure` below\n```\n\nor a better approach — instead of exporting these variables in each terminal session, you can add them permanently to your terminal by editing your `~/.zshrc` file:\n1. Open your `~/.zshrc` file:\n   ```bash\n   nano ~/.zshrc\n   ```\n2. Add the following lines at the end of the file (replace with your own AWS credentials):\n   ```bash\n   export AWS_ACCESS_KEY_ID=\"your_access_key\"\n   export AWS_SECRET_ACCESS_KEY=\"your_secret_key\"\n   export AWS_REGION=\"your_region\"\n   export EMR_CLUSTER_ID=\"your_cluster_id\"\n   export S3_PATH=\"s3://your-bucket/path\"\n   ```\n3. Save and exit the file (`Ctrl + X`).\n4. To apply the changes immediately, run:\n   ```bash\n   source ~/.zshrc\n   ```\n\nNow, you won’t have to export the variables manually in each session, and they’ll be available whenever you open a new terminal session.\n\n--- \n\n\n### Bootstrap Actions\nFor EMR cluster setup with required dependencies, create a bootstrap script (e.g.: `bootstrap.sh`);\n\n```bash\n#!/bin/bash -xe\n\n# Example structure of a bootstrap script\n# Create and activate virtual environment\npython3 -m venv /home/hadoop/myenv\nsource /home/hadoop/myenv/bin/activate\n\n# Install system dependencies\nsudo yum install python3-pip -y\nsudo yum install -y [your-system-packages]\n\n# Install Python packages\npip3 install [your-required-packages]\n\ndeactivate\n```\n\nE.g\n```bash\n#!/bin/bash -xe\n\n# Create and activate a virtual environment\npython3 -m venv /home/hadoop/myenv\nsource /home/hadoop/myenv/bin/activate\n\n# Install pip for Python 3.x\nsudo yum install python3-pip -y\n\n# Install required packages\npip3 install \\\n    pyspark==3.5.5 \\\n\ndeactivate\n```\n\nUpload the bootstrap script to S3 and reference it in your EMR cluster configuration.\n\n## 📁 Project Structure\n\n```\nEMRRunner/\n├── Dockerfile\n├── LICENSE.md\n├── README.md\n├── app/\n│   ├── __init__.py\n│   ├── cli.py              # Command-line interface\n│   ├── config.py           # Configuration management\n│   ├── emr_client.py       # EMR interaction logic\n│   ├── emr_job_api.py      # Flask API endpoints\n│   ├── run_api.py          # API server runner\n│   └── schema.py           # Request/Response schemas\n├── bootstrap/\n│   └── bootstrap.sh        # EMR bootstrap script\n├── tests/\n│   ├── __init__.py\n│   ├── test_config.py\n│   ├── test_emr_job_api.py\n│   └── test_schema.py\n├── pyproject.toml\n├── requirements.txt\n└── setup.py\n```\n\n## 📦 S3 Job Structure\n\nThe `S3_PATH` in your configuration should point to a bucket with the following structure:\n\n```\ns3://your-bucket/\n├── jobs/\n│   ├── job1/\n│   │   ├── job_package.zip  # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.\n│   └── job2/\n│   │   ├── job_package.zip  # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.\n```\n\n### Job Script (`main.py`)\n\nYour job script should include the necessary logic for executing the tasks in your data pipeline, using functions from your dependencies.\n\nExample of `main.py`:\n\n```python\nfrom dependencies import clean, transform, sink  # Import your core job functions\n\ndef main():\n    # Step 1: Clean the data\n    clean()\n\n    # Step 2: Transform the data\n    transform()\n\n    # Step 3: Sink (store) the processed data\n    sink()\n\nif __name__ == \"__main__\":\n    main()  # Execute the main function when the script is run\n```\n\n\n## 💻 Usage\n\n### Command Line Interface\n\nStart a job in client mode:\n```bash\nemrrunner start --job job1\n```\n\nStart a job in cluster mode:\n```bash\nemrrunner start --job job1 --deploy-mode cluster\n```\n\n### API Endpoints\n\nStart a job via API in client mode (default):\n```bash\ncurl -X POST http://localhost:8000/emrrunner/start \\\n     -H \"Content-Type: application/json\" \\\n     -d '{\"job\": \"job1\"}'\n```\n\nStart a job via API in cluster mode:\n```bash\ncurl -X POST http://localhost:8000/emrrunner/start \\\n     -H \"Content-Type: application/json\" \\\n     -d '{\"job\": \"job1\", \"deploy_mode\": \"cluster\"}'\n```\n\n## 🔧 Development\n\nTo contribute to EMRRunner:\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Submit a pull request\n\n## 💡 Best Practices\n\n1. **Bootstrap Actions**\n   - Keep bootstrap scripts modular\n   - Version control your dependencies\n   - Use specific package versions\n   - Test bootstrap scripts locally when possible\n   - Store bootstrap scripts in S3 with versioning enabled\n\n2. **Job Dependencies**\n   - Maintain a requirements.txt for each job\n   - Document system-level dependencies\n   - Test dependencies in a clean environment\n\n3. **Job Organization**\n   - Follow the standard structure for jobs\n   - Use clear naming conventions\n   - Document all functions and modules\n\n## 📝 License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.\n\n## 👥 Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## 🐛 Bug Reports\n\nIf you discover any bugs, please create an issue on GitHub with:\n- Any details about your local setup that might be helpful in troubleshooting\n- Detailed steps to reproduce the bug\n\n---\n\nBuilt with ❤️ using Python and AWS EMR","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaabiy%2Femrrunner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhaabiy%2Femrrunner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhaabiy%2Femrrunner/lists"}