Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danilop/samplebatchprocessing
Sample Implementation of Batch Processing on Amazon Web Services (AWS)
https://github.com/danilop/samplebatchprocessing
Last synced: 2 months ago
JSON representation
Sample Implementation of Batch Processing on Amazon Web Services (AWS)
- Host: GitHub
- URL: https://github.com/danilop/samplebatchprocessing
- Owner: danilop
- Created: 2013-07-09T13:16:15.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2016-02-18T07:47:31.000Z (almost 9 years ago)
- Last Synced: 2024-04-15T04:54:12.403Z (9 months ago)
- Language: Shell
- Homepage: http://danilop.github.io/SampleBatchProcessing
- Size: 46.9 KB
- Stars: 50
- Watchers: 5
- Forks: 8
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Sample Implementation of Batch Processing on Amazon Web Services (AWS)
This is a Sample Implementation for the [AWS Reference Architecture for Batch Processing](http://aws.amazon.com/architecture/).
Is is implemented in Python, using [boto](http://aws.amazon.com/sdkforpython/), and the new [AWS Command Line Interface (CLI)](http://aws.amazon.com/cli/).
Two tools are provided:
* SendJobs.py - to upload files from a (local) directory to S3 and put "job" requests to process those files as messages in an SQS queue
* GetJobs.py - to get "job" messages from an SQS queue and upload on S3 the outcome of the processingThe setup leverages [EC2](http://aws.amazon.com/ec2/) [Auto Scaling](http://aws.amazon.com/autoscaling/) to have a group of instances that is empty (i.e. no instance is running) when there are no "job" requests in the SQS queue and grows when there is the need.
## Tutorial
### Install AWS CLI
The new [AWS Command Line Interface (CLI) tool](http://aws.amazon.com/cli/)
is Python based, so you can install it using "pip"pip install awscli
or using "easy_install"
easy_install awscli
Before using AWS CLI, you first need to specify your AWS account credentials and default AWS region as described
[here](http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html).The awscli package includes a very useful command completion feature,
e.g. to enable tab completion for bash use the built-in command complete (not boot persistant):complete -C aws_completer aws
### Create an S3 Bucket to host input and output files
You can create a bucket from the [S3 web console](http://console.aws.amazon.com/s3/) or using the CLI:
aws s3api create-bucket --bucket \
--create-bucket-configuration '{ "location_constraint": }'### Create an SQS Queue to centralize "job" requests
You can create a queue from the [SQS web console](http://console.aws.amazon.com/sqs/) or using the CLI:
The "VisibilityTimeout" is expressed in seconds and should be larger than the maximun processing time required for a "job".
It can eventually be increased for a single "job", but that is not part of this implementation.aws sqs create-queue --queue-name --attributes VisibilityTimeout=60
### Create a IAM Role to delegate access to processing instances
From the [IAM web console](http://console.aws.amazon.com/iam/) -> Roles -> Create Role ->
Write a role name.Under "AWS Service Roles" select "Amazon EC2".
Select a "Custom Policy", write a policy name and see the "role.json" file
for a sample role giving access to an S3 bucket and an SQS queue.
You should replace "AWS Account", "S3 Bucket Name" and "SQS Queue Name" in the policy with yours.
Write down the Instance Profile ARN from the Summary tab, you'll need it later.### Create Auto Scaling Launch Configuration
For this sample I'm using a default Amazon Linux EBS-backed AMI, you can take the AMI ID [here](http://aws.amazon.com/amazon-linux-ami)
The user data script provided automatically configures and run multiple parallel "GetJobs.py" scripts per node to get "job" from the queue and process them, uploading the final result back on S3. You probably need to edit the "user-data.sh" file before launching the following command.
Alternatively you can create your own AMI that starts one of more parallel "GetJobs.py" scripts at boot.aws autoscaling create-launch-configuration --launch-configuration-name asl-batch \
--image-id --instance-type \
--iam-instance-profile --user-data "`cat user-data.sh`"If you want to be able to login into the instances launched by Auto Scaling you can add the following parametrs to the previous command
--key-name --security-groups
### Create Auto Scaling Group
aws autoscaling create-auto-scaling-group --auto-scaling-group-name asg-batch \
--launch-configuration-name asl-batch --min-size 0 \
--max-size \
--availability-zones \
--default-cooldown 300### Create Auto Scaling "Up" Policy
aws autoscaling put-scaling-policy --auto-scaling-group-name asg-batch --policy-name ash-batch-upscale-policy \
--scaling-adjustment \
--adjustment-type ExactCapacityWrite down the "PolicyARN", you need it in the next step to set up the alarm.
### Create CloudWatch Alarm to trigger "Up" scaling Policy
aws cloudwatch put-metric-alarm --alarm-name StartBatchProcessing --metric-name ApproximateNumberOfMessagesVisible \
--namespace "AWS/SQS" --statistic Average --period 60 --evaluation-periods 2 --threshold 1 \
--comparison-operator GreaterThanOrEqualToThreshold --dimensions Name=QueueName,Value=batch-queue \
--alarm-actions <"Up" PolicyARN>### Create Auto Scaling "Down" Policy
aws autoscaling put-scaling-policy --auto-scaling-group-name asg-batch --policy-name ash-batch-downscale-policy \
--scaling-adjustment 0 --adjustment-type ExactCapacityWrite down the "PolicyARN", you need it in the next step to set up the alarm.
### Create CloudWatch Alarm to trigger "Down" scaling Policy
aws cloudwatch put-metric-alarm --alarm-name StopBatchProcessing --metric-name ApproximateNumberOfMessagesVisible \
--namespace "AWS/SQS" --statistic Average --period 60 --evaluation-periods 2 --threshold 0 \
--comparison-operator LessThanOrEqualToThreshold --dimensions Name=QueueName,Value=batch-queue \
--alarm-actions <"Down" PolicyARN>### Send the jobs uploading files from a directory
The directory can be local or on an EC2 instance.
./SendJobs.py input/ output/
To get help, run the tool without options
./SendJobs.py
After a few minutes the first CloudWatch Alarm should trigger the "Up" scaling Policy
to start EC2 Instances configured to consume "jobs" from the SQS queue.
When all "jobs" are processed and the SQS is "empty" the second CloudWatch Alarm should trigger
the "Down" scaling Policy to shutdown and terminate the EC2 Instances.
You should find the output of the processing in the S3 bucket under the "ouput/" prefix.### Change the Launch Configuration of an Auto Scaling Group
If later on you need to change the Launch Configuration create a new one and update the Auto Scaling Group, e.g.
aws autoscaling update-auto-scaling-group --launch-configuration-name asl-batch-v2 \
--auto-scaling-group-name asg-batch