https://github.com/cbuschka/aws-scatter-gather
Scatter gather with AWS lambda
https://github.com/cbuschka/aws-scatter-gather
aws fork-join lambda map-reduce python scatter-gather step-functions terraform
Last synced: 10 days ago
JSON representation
Scatter gather with AWS lambda
- Host: GitHub
- URL: https://github.com/cbuschka/aws-scatter-gather
- Owner: cbuschka
- License: apache-2.0
- Created: 2020-06-09T20:08:30.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-10-04T16:05:23.000Z (about 5 years ago)
- Last Synced: 2025-03-05T16:17:31.912Z (8 months ago)
- Topics: aws, fork-join, lambda, map-reduce, python, scatter-gather, step-functions, terraform
- Language: Python
- Homepage:
- Size: 592 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
- License: license.txt
Awesome Lists containing this project
README
# Scatter gather with AWS lambda
## The challenge
Implement batch processing on AWS:
* scatter: Split up a single file of records to be processed (the file has been uploaded via s3)
* process: Process records as parallel as possible
* gather: Detect completion of processing and aggregate a result summary report in s3## Prerequisites
* python 3.8
* GNU make
* docker
* [awscli](https://docs.aws.amazon.com/de_de/cli/latest/userguide/cli-chap-install.html)
* [tfvm](https://github.com/cbuschka/tfvm) or terraform
* [cw >= v3.3.0](https://github.com/lucagrulla/cw)## Usage locally
### Start localstack, deploy and run benchmark
```
make clean start_localstack deploy benchmark report
```### Stop and cleanup
```
make stop_localstack clean
```## Usage on aws
All resources will be prefixed with your current ${USER}-. Pass
SCOPE=mycustomprefix- to make to override this default.### Build, package, deploy run benchmark, report on measurements
```
make ENV=aws clean deploy_resources deploy_service benchmark report
```### Undeploy
```
make ENV=aws destroy
```### Variants
The task has been implemented in various variants:
* s3-sqs-lambda-sync (with boto3 blocking io)

* s3-sqs-lambda-async (with aioboto3 async io)
* s3-sqs-lambda-async-chunked (with aioboto3 async io, records packed into chunks)
* s3-sqs-lambda-dynamodb (with aioboto3 async io, records stored in dynamodb)

* s3-notification-sqs-lambda (with aioboto3 async io, records stored in s3 in chunks, functions invoked by s3 notifications through sqs queues)### More/ alternative variants
* sfn?
* glue?
* emr (spark)?
* s3 athena?
* s3 batch
* single fat vm### Results
[(Data)](./measurements.csv)
### Repository structure
* [infra](./infra) - Resources and service infrastructure
* [src](./src) - Service sources
* [tests](./tests) - Service tests
* [benchmark](./benchmark) - Benchmark sources### Documentation
* [Learnings](./doc/learnings.md)
* [Links](./doc/links.md)
* [Troubleshooting](./doc/troubleshooting.md)## License
Copyright 2020 by Cornelius Buschka. All rights reserved.[Apache Public License 2.0](./license.txt)