https://github.com/haabiy/emrrunner
A powerful CLI tool for managing Python-based jobs on Amazon EMR clusters.
https://github.com/haabiy/emrrunner
cloud-computing distributed-systems emr flask software-engineering
Last synced: about 1 month ago
JSON representation
A powerful CLI tool for managing Python-based jobs on Amazon EMR clusters.
- Host: GitHub
- URL: https://github.com/haabiy/emrrunner
- Owner: Haabiy
- License: mit
- Created: 2024-03-14T23:26:11.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-04-05T08:22:27.000Z (12 months ago)
- Last Synced: 2025-04-05T08:27:33.536Z (12 months ago)
- Topics: cloud-computing, distributed-systems, emr, flask, software-engineering
- Language: Python
- Homepage: https://pypi.org/project/emrrunner/
- Size: 12.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
# EMRRunner




A powerful command-line tool for managing and deploying Python-based (e.g., PySpark) data pipeline jobs on Amazon EMR clusters.
## π Features
- Command-line interface for quick job submission
- Basic POST API for fast job submission
- Support for both client and cluster deploy modes
## π Prerequisites
- Python 3.9+
- AWS Account with EMR access
- Configured AWS credentials
- Active EMR cluster
## π οΈ Installation
### From PyPI
```bash
pip install emrrunner
```
### From Source
```bash
# Clone the repository
git clone https://github.com/Haabiy/EMRRunner.git && cd EMRRunner
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate
# Install the package
pip install -e .
```
## βοΈ Configuration
### AWS Configuration
Create a `.env` file in the project root with your AWS configuration or export these variables in your terminal before running:
```Bash
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION="your_region"
export EMR_CLUSTER_ID="your_cluster_id"
export S3_PATH="s3://your-bucket/path" # The path to your jobs (the directory containing your job_package.zip file)...see `S3 Job Structure` below
```
or a better approach β instead of exporting these variables in each terminal session, you can add them permanently to your terminal by editing your `~/.zshrc` file:
1. Open your `~/.zshrc` file:
```bash
nano ~/.zshrc
```
2. Add the following lines at the end of the file (replace with your own AWS credentials):
```bash
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_REGION="your_region"
export EMR_CLUSTER_ID="your_cluster_id"
export S3_PATH="s3://your-bucket/path"
```
3. Save and exit the file (`Ctrl + X`).
4. To apply the changes immediately, run:
```bash
source ~/.zshrc
```
Now, you wonβt have to export the variables manually in each session, and theyβll be available whenever you open a new terminal session.
---
### Bootstrap Actions
For EMR cluster setup with required dependencies, create a bootstrap script (e.g.: `bootstrap.sh`);
```bash
#!/bin/bash -xe
# Example structure of a bootstrap script
# Create and activate virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate
# Install system dependencies
sudo yum install python3-pip -y
sudo yum install -y [your-system-packages]
# Install Python packages
pip3 install [your-required-packages]
deactivate
```
E.g
```bash
#!/bin/bash -xe
# Create and activate a virtual environment
python3 -m venv /home/hadoop/myenv
source /home/hadoop/myenv/bin/activate
# Install pip for Python 3.x
sudo yum install python3-pip -y
# Install required packages
pip3 install \
pyspark==3.5.5 \
deactivate
```
Upload the bootstrap script to S3 and reference it in your EMR cluster configuration.
## π Project Structure
```
EMRRunner/
βββ Dockerfile
βββ LICENSE.md
βββ README.md
βββ app/
β βββ __init__.py
β βββ cli.py # Command-line interface
β βββ config.py # Configuration management
β βββ emr_client.py # EMR interaction logic
β βββ emr_job_api.py # Flask API endpoints
β βββ run_api.py # API server runner
β βββ schema.py # Request/Response schemas
βββ bootstrap/
β βββ bootstrap.sh # EMR bootstrap script
βββ tests/
β βββ __init__.py
β βββ test_config.py
β βββ test_emr_job_api.py
β βββ test_schema.py
βββ pyproject.toml
βββ requirements.txt
βββ setup.py
```
## π¦ S3 Job Structure
The `S3_PATH` in your configuration should point to a bucket with the following structure:
```
s3://your-bucket/
βββ jobs/
β βββ job1/
β β βββ job_package.zip # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.
β βββ job2/
β β βββ job_package.zip # Include shared functions and utilities, make sure your main script is named `main.py`, and name your zip file `job_package.zip`.
```
### Job Script (`main.py`)
Your job script should include the necessary logic for executing the tasks in your data pipeline, using functions from your dependencies.
Example of `main.py`:
```python
from dependencies import clean, transform, sink # Import your core job functions
def main():
# Step 1: Clean the data
clean()
# Step 2: Transform the data
transform()
# Step 3: Sink (store) the processed data
sink()
if __name__ == "__main__":
main() # Execute the main function when the script is run
```
## π» Usage
### Command Line Interface
Start a job in client mode:
```bash
emrrunner start --job job1
```
Start a job in cluster mode:
```bash
emrrunner start --job job1 --deploy-mode cluster
```
### API Endpoints
Start a job via API in client mode (default):
```bash
curl -X POST http://localhost:8000/emrrunner/start \
-H "Content-Type: application/json" \
-d '{"job": "job1"}'
```
Start a job via API in cluster mode:
```bash
curl -X POST http://localhost:8000/emrrunner/start \
-H "Content-Type: application/json" \
-d '{"job": "job1", "deploy_mode": "cluster"}'
```
## π§ Development
To contribute to EMRRunner:
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Submit a pull request
## π‘ Best Practices
1. **Bootstrap Actions**
- Keep bootstrap scripts modular
- Version control your dependencies
- Use specific package versions
- Test bootstrap scripts locally when possible
- Store bootstrap scripts in S3 with versioning enabled
2. **Job Dependencies**
- Maintain a requirements.txt for each job
- Document system-level dependencies
- Test dependencies in a clean environment
3. **Job Organization**
- Follow the standard structure for jobs
- Use clear naming conventions
- Document all functions and modules
## π License
This project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details.
## π₯ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## π Bug Reports
If you discover any bugs, please create an issue on GitHub with:
- Any details about your local setup that might be helpful in troubleshooting
- Detailed steps to reproduce the bug
---
Built with β€οΈ using Python and AWS EMR