https://github.com/localstack-samples/serverless-data-processing-pipeline
API Gateway -> Lambda -> Kinesis Stream -> Lambda -> DynamoDB -> DynamoDB stream -> Lambda -> Postgres
https://github.com/localstack-samples/serverless-data-processing-pipeline
Last synced: 9 months ago
JSON representation
API Gateway -> Lambda -> Kinesis Stream -> Lambda -> DynamoDB -> DynamoDB stream -> Lambda -> Postgres
- Host: GitHub
- URL: https://github.com/localstack-samples/serverless-data-processing-pipeline
- Owner: localstack-samples
- License: mit
- Created: 2024-05-17T12:33:00.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-23T13:32:45.000Z (about 1 year ago)
- Last Synced: 2025-03-24T18:06:30.748Z (9 months ago)
- Language: Go
- Size: 43 KB
- Stars: 2
- Watchers: 16
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# serverless-data-processing-pipeline
This is a sample CDK app that creates a *API Gateway -> Lambda -> Kinesis Stream -> Lambda -> DynamoDB -> DynamoDB stream -> Lambda -> CloudWatch Metrics* chain and then we benchmark the time it takes to complete this loop on a M1 max chip.
## Prerequisites
The following dependencies need to be available on your machine:
1. [Go](https://go.dev/doc/install).
1. [Localstack CLI](https://docs.localstack.cloud/getting-started/installation/).
1. [CDK CLI](https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html).
1. [Watchman](https://facebook.github.io/watchman/docs/install).
1. [jq](https://jqlang.github.io/jq/download/).
1. [k6](https://k6.io/docs/get-started/installation/).
## Commands
* `localstack start` start LocalStack with the Docker executor
* `cdk bootstrap` bootstrap cdk stack onto AWS/LocalStack
* `cdk deploy` deploy this stack to your default AWS account/region
* `cdk diff` compare deployed stack with current state
* `cdk synth` emits the synthesized CloudFormation template
* `go test` run unit tests
* `watchman [upstream|midstream|downstream]` watch and hot-reload lambda functions
* `wait_requests ` wait until all requests are processed and export the latencies
## Configuration
* `USE_LOCALSTACK` set to `true` if the stack is deployed to LocalStack
* `HOT_DEPLOY` set to `true` if the hot-reloading feature is to be enabled
* `LAMBDA_DIST_PATH` directory where binaries for the hot-reloading feature are stored (optional)
* `LAMBDA_SRC_PATH` directory where the src of the lambda functions is found
## Deploy
On LocalStack:
```bash
export PROVIDER_OVERRIDE_CLOUDWATCH=v1
export LAMBDA_EVENT_SOURCE_MAPPING=v2
localstack start -d
export USE_LOCALSTACK=true
export HOT_DEPLOY=true
cdklocal bootstrap
cdklocal deploy --require-approval=never
```
On AWS:
```
cdk bootstrap --profile aws
cdk deploy --require-approval=never --profile aws
```
## Sample Run
After deploying the stack, retrieve the method's endpoint by inspecting the CfnOutput outputs like in the following example:
```sh
localstack@macintosh serverless-data-processing-pipeline % USE_LOCALSTACK=true HOT_DEPLOY=true cdklocal deploy --require-approval=never
✨ Synthesis time: 3.72s
ServerlessDataProcessingPipelineStack: start: Building dd5711540f04e06aa955d7f4862fc04e8cdea464cb590dae91ed2976bb78098e:current_account-current_region
ServerlessDataProcessingPipelineStack: success: Built dd5711540f04e06aa955d7f4862fc04e8cdea464cb590dae91ed2976bb78098e:current_account-current_region
ServerlessDataProcessingPipelineStack: start: Building 4c4836f6c768f4500c058ac6a02f2090830a58eb1a0e58d59a5c7ffadf208861:current_account-current_region
ServerlessDataProcessingPipelineStack: success: Built 4c4836f6c768f4500c058ac6a02f2090830a58eb1a0e58d59a5c7ffadf208861:current_account-current_region
ServerlessDataProcessingPipelineStack: start: Publishing dd5711540f04e06aa955d7f4862fc04e8cdea464cb590dae91ed2976bb78098e:current_account-current_region
ServerlessDataProcessingPipelineStack: start: Publishing 4c4836f6c768f4500c058ac6a02f2090830a58eb1a0e58d59a5c7ffadf208861:current_account-current_region
ServerlessDataProcessingPipelineStack: success: Published 4c4836f6c768f4500c058ac6a02f2090830a58eb1a0e58d59a5c7ffadf208861:current_account-current_region
ServerlessDataProcessingPipelineStack: success: Published dd5711540f04e06aa955d7f4862fc04e8cdea464cb590dae91ed2976bb78098e:current_account-current_region
ServerlessDataProcessingPipelineStack: deploying... [1/1]
ServerlessDataProcessingPipelineStack: creating CloudFormation changeset...
✅ ServerlessDataProcessingPipelineStack
✨ Deployment time: 30.69s
Outputs:
ServerlessDataProcessingPipelineStack.ApiEndpoint4F160690 = https://tsyeuri986.execute-api.localhost.localstack.cloud:4566/prod/
ServerlessDataProcessingPipelineStack.ApiGatewayMethodEndpoint = https://tsyeuri986.execute-api.localhost.localstack.cloud:4566/prod/
ServerlessDataProcessingPipelineStack.DynamoDBTableName = ServerlessDataProcessingPipeline-DynamoDBTable59784FC0-072648f2
ServerlessDataProcessingPipelineStack.Environment = LocalStack
ServerlessDataProcessingPipelineStack.KinesisStreamName = KinesisStream
Stack ARN:
arn:aws:cloudformation:us-east-1:000000000000:stack/ServerlessDataProcessingPipelineStack/68a8d688
✨ Total time: 34.4s
localstack@macintosh serverless-data-processing-pipeline % export APIGW_ENDPOINT="https://tsyeuri986.execute-api.localhost.localstack.cloud:4566/prod/"
```
Followed by a sample request:
```sh
localstack@macintosh serverless-data-processing-pipeline % timestamp=$(awk 'BEGIN {srand(); print srand()}')
localstack@macintosh serverless-data-processing-pipeline % curl -XPOST -H "Content-Type: application/json" $APIGW_ENDPOINT -d "$(jq -n --arg ts "$timestamp" '{id: "1", message: "Hello World", timestamp: $ts | tonumber}')" -i
HTTP/2 200
content-type: application/json
content-length: 21
date: Fri, 31 May 2024 18:07:54 GMT
server: hypercorn-h2
{"message":"success"}
```
## Stress Test
```sh
$ k6 run -e APIGW_ENDPOINT=$APIGW_ENDPOINT loadtest.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: loadtest.js
output: -
scenarios: (100.00%) 1 scenario, 10 max VUs, 1m30s max duration (incl. graceful stop):
* default: 10 looping VUs for 1m0s (gracefulStop: 30s)
✓ status was 200
✓ transaction time OK
checks.........................: 100.00% ✓ 3432 ✗ 0
data_received..................: 272 kB 4.5 kB/s
data_sent......................: 235 kB 3.9 kB/s
http_req_blocked...............: avg=484.25µs min=0s med=1µs max=87.94ms p(90)=1µs p(95)=1µs
http_req_connecting............: avg=4.97µs min=0s med=0s max=940µs p(90)=0s p(95)=0s
http_req_duration..............: avg=350.92ms min=203.55ms med=339.86ms max=725.44ms p(90)=406.8ms p(95)=488.96ms
{ expected_response:true }...: avg=350.92ms min=203.55ms med=339.86ms max=725.44ms p(90)=406.8ms p(95)=488.96ms
http_req_failed................: 0.00% ✓ 0 ✗ 1658
http_req_receiving.............: avg=41.04ms min=31.15ms med=40.83ms max=55.17ms p(90)=42.03ms p(95)=42.94ms
http_req_sending...............: avg=64.99µs min=12µs med=42µs max=2.06ms p(90)=114.5µs p(95)=150µs
http_req_tls_handshaking.......: avg=213.15µs min=0s med=0s max=41.6ms p(90)=0s p(95)=0s
http_req_waiting...............: avg=309.81ms min=162.9ms med=298.6ms max=689.61ms p(90)=366.07ms p(95)=441.45ms
http_reqs......................: 1658 28.289592/s
iteration_duration.............: avg=351.62ms min=225.77ms med=340.5ms max=725.59ms p(90)=406.92ms p(95)=489.08ms
iterations.....................: 1658 28.289592/s
vus............................: 10 min=10 max=10
vus_max........................: 10 min=10 max=10
running (1m00.7s), 00/10 VUs, 1658 complete and 0 interrupted iterations
default ✓ [======================================] 10 VUs 1m0s
```
And then let's wait until all requests have been processed by the `midstream` and `downstream` Lambda functions. Let's also save the timestamps that indicate how much time it took each request to flow through the entire pipeline.
```sh
$ ./wait_requests timestamps.json
Monitoring CloudWatch metrics for new datapoints...
No new datapoints added. Exiting.
Exporting CloudWatch metrics to timestamps.json...
```
## Compute Results
We need to see how much time it takes (based on percentiles, averages, etc) to run a request through the entire serverless pipeline while the large number of VUs (virtual users) hit it with never ending requests for an entire minute.
```python
import pandas as pd
# Load the data from the JSON file into a pandas DataFrame
data = pd.read_json('timestamps.json')
# Calculate the desired statistics
stats = data.describe(percentiles=[.90, .95, .99])
# Print the statistics
print(stats)
```
The output of that would be:
```text
count 93.000000
mean 40.905096
std 16.211245
min 6.285714
50% 47.187500
90% 56.971429
95% 57.614691
99% 58.549990
max 58.939394
```
The `count` param tells us the whole experiment ran for 93 seconds, but our `k6` test only ran for 60 seconds, so there was some backlogging that occurred. So given that there were `1658` total inbound requests, LocalStack managed to process about 17.8 requests/s. Or more specifically, the pipeline was able to run 17.8 times per second. And that's for 10 virtual users, so about 1.78 requests/s/VU.