https://github.com/haabiy/emrrunner
A powerful CLI tool for managing Python-based jobs on Amazon EMR clusters.
https://github.com/haabiy/emrrunner
cloud-computing distributed-systems emr flask software-engineering
Last synced: 6 months ago
JSON representation
A powerful CLI tool for managing Python-based jobs on Amazon EMR clusters.
- Host: GitHub
- URL: https://github.com/haabiy/emrrunner
- Owner: Haabiy
- License: mit
- Created: 2024-03-14T23:26:11.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-05T08:22:27.000Z (6 months ago)
- Last Synced: 2025-04-05T08:27:33.536Z (6 months ago)
- Topics: cloud-computing, distributed-systems, emr, flask, software-engineering
- Language: Python
- Homepage: https://pypi.org/project/emrrunner/
- Size: 12.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# EMRRunner



A powerful command-line tool for managing and deploying Python-based (e.g., PySpark) data pipeline jobs on Amazon EMR clusters.
## π Features
- Command-line interface for quick job submission
- Basic POST API for fast job submission
- Support for both client and cluster deploy modes## π Prerequisites
- Python 3.9+
- AWS Account with EMR access
- Configured AWS credentials
- Active EMR cluster## π οΈ Installation
### From PyPI
```bash
pip install emrrunner
```### From Source
```bash
# Clone the repository
git clone https://github.com/Haabiy/EMRRunner.git && cd EMRRunner# Create and activate virtual environment
python -m venv venv
source venv/bin/activate# Install the package
pip install -e .
```## βοΈ Configuration
### AWS Configuration
Create a `.env` file in the project root with your AWS configuration or export these variables in your terminal before running:
```Bash
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION="your_region"
export EMR_CLUSTER_ID="your_cluster_id"
export S3_PATH="s3://your-bucket/path" # The path to your jobs (the directory containing your job_package.zip file)...see `S3 Job Structure` below
```or a better approach β instead of exporting these variables in each terminal session, you can add them permanently to your terminal by editing your `~/.zshrc` file:
1. Open your `~/.zshrc` file:
```bash
nano ~/.zshrc
```
2. Add the following lines at the end of the file (replace with your own AWS credentials):
```bash
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION="your_region"
export EMR_CLUSTER_ID="your_cluster_id"
export S3_PATH="s3://your-bucket/path"
```
3. Save and exit the file (`Ctrl + X`).
4. To apply the changes immediately, run:
```bash
source ~/.zshrc
```Now, you wonβt have to export the variables manually in each session, and theyβll be available whenever you open a new terminal session.
---
### Bootstrap Actions
For EMR cluster setup with required dependencies, create a bootstrap script (e.g.: `bootstrap.sh`);```bash
#!/bin/bash -xe# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]# Install Python packages
pip3 install [your-required-packages]deactivate
```E.g
```bash
#!/bin/bash -xe# Create and activate a virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate# Install pip for Python 3.x
sudo yum install python3-pip -y# Install required packages
pip3 install \
pyspark==3.5.5 \deactivate
```Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.
## π Project Structure
```
EMRRunner/
βββ Dockerfile
βββ LICENSE.md
βββ README.md
βββ app/
β βββ __init__.py
β βββ cli.py # Command-line interface
β βββ config.py # Configuration management
β βββ emr_client.py # EMR interaction logic
β βββ emr_job_api.py # Flask API endpoints
β βββ run_api.py # API server runner
β βββ schema.py # Request/Response schemas
βββ bootstrap/
β βββ bootstrap.sh # EMR bootstrap script
βββ tests/
β βββ __init__.py
β βββ test_config.py
β βββ test_emr_job_api.py
β βββ test_schema.py
βββ pyproject.toml
βββ requirements.txt
βββ setup.py
```## π¦ S3 Job Structure
The `S3_PATH` in your configuration should point to a bucket with the following structure:
```
s3://your-bucket/
βββ jobs/
β βββ job1/
β β βββ job_package.zip # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.
β βββ job2/
β β βββ job_package.zip # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.
```### Job Script (`main.py`)
Your job script should include the necessary logic for executing the tasks in your data pipeline, using functions from your dependencies.
Example of `main.py`:
```python
from dependencies import clean, transform, sink # Import your core job functionsdef main():
# Step 1: Clean the data
clean()# Step 2: Transform the data
transform()# Step 3: Sink (store) the processed data
sink()if __name__ == "__main__":
main() # Execute the main function when the script is run
```## π» Usage
### Command Line Interface
Start a job in client mode:
```bash
emrrunner start --job job1
```Start a job in cluster mode:
```bash
emrrunner start --job job1 --deploy-mode cluster
```### API Endpoints
Start a job via API in client mode (default):
```bash
curl -X POST http://localhost:8000/emrrunner/start \
-H "Content-Type: application/json" \
-d '{"job": "job1"}'
```Start a job via API in cluster mode:
```bash
curl -X POST http://localhost:8000/emrrunner/start \
-H "Content-Type: application/json" \
-d '{"job": "job1", "deploy_mode": "cluster"}'
```## π§ Development
To contribute to EMRRunner:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Submit a pull request## π‘ Best Practices
1. **Bootstrap Actions**
- Keep bootstrap scripts modular
- Version control your dependencies
- Use specific package versions
- Test bootstrap scripts locally when possible
- Store bootstrap scripts in S3 with versioning enabled2. **Job Dependencies**
- Maintain a requirements.txt for each job
- Document system-level dependencies
- Test dependencies in a clean environment3. **Job Organization**
- Follow the standard structure for jobs
- Use clear naming conventions
- Document all functions and modules## π License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
## π₯ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## π Bug Reports
If you discover any bugs, please create an issue on GitHub with:
- Any details about your local setup that might be helpful in troubleshooting
- Detailed steps to reproduce the bug---
Built with β€οΈ using Python and AWS EMR