Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alexcasalboni/serverless-data-pipeline-sam
Serverless Data Pipeline powered by Kinesis Firehose, API Gateway, Lambda, S3, and Athena
https://github.com/alexcasalboni/serverless-data-pipeline-sam
amazon-web-services aws aws-lambda aws-s3 aws-sam cloudformation data-pipeline
Last synced: 16 days ago
JSON representation
Serverless Data Pipeline powered by Kinesis Firehose, API Gateway, Lambda, S3, and Athena
- Host: GitHub
- URL: https://github.com/alexcasalboni/serverless-data-pipeline-sam
- Owner: alexcasalboni
- License: apache-2.0
- Created: 2017-11-07T22:33:29.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2018-10-22T23:27:40.000Z (about 6 years ago)
- Last Synced: 2024-10-13T09:07:21.870Z (about 1 month ago)
- Topics: amazon-web-services, aws, aws-lambda, aws-s3, aws-sam, cloudformation, data-pipeline
- Language: Python
- Homepage:
- Size: 14.6 KB
- Stars: 86
- Watchers: 8
- Forks: 31
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Serverless Data Pipeline - Powered by AWS SAM
Serverless Data Pipeline build with Amazon API Gateway, AWS Lambda, Amazon Kinesis Firehose, Amazon S3, and Amazon Athena.## How to deploy the stack
See `scripts/deploy.sh` (customize your deployment bucket and stack name).
## How to ingest new records via API
See `scripts/track.sh` (customize your stack name).
## What kind of queries can I run on the dataset?
It depends on the data that you collect and on the virtual tables that you define on Athena and Glue.
The file `queries.sql` contains a few sample queries that you can run with the default schema (e.g. `{"name": "John", "action": "charge", "value": 100}`).
## Resources list
This stack will create the following resources:
* An **API Gateway endpoint** that you can use to `track` events by submitting any JSON data via the HTTP POST method
* A **Kinesis Firehose Delivery Stream** that will buffer, optionally compress, and write each record into S3
* A **Lambda Function** to process/manipulate/clean/skip records before they get written into S3
* An **S3 Bucket** that will contain all the collected data
* Three **Athena Named Queries** to get started quickly with serverless queries
* An **IAM Role and Policy** for API Gateway
* An **IAM Role and Policy** for Kinesis Firehose## Parameters
* **ApiStageName**: The API Gateway Stage name (e.g. dev, prod, etc.)
* **FirehoseS3Prefix**: The S3 Key prefix for Kinesis Firehose
* **FirehoseCompressionFormat**: The compression format used by Kinesis Firehose
* **FirehoseBufferingInterval**: How long Firehose will wait before writing a new batch into S3
* **FirehoseBufferingSize**: The maximum batch size in MB
* **LambdaTimeout**: Lambda's max execution time in seconds
* **LambdaMemorySize**: Lambda's max memory configuration
* **AthenaDatabaseName**: The Athena database name
* **AthenaTableName**: The Athena table name## Outputs
* **TrackURL**: The public URL to submit new records
* **BucketName**: The bucket that will store your data
* **FunctionName**: The Lambda Function that will process/validate records## Gotchas
* The architecture is 100% serverless (no hourly costs, no servers to manage)
* The API Gateway endpoint is publicly accessible (i.e. any browser or anonymous website user can potentially submit new records/events)
* You can customize the template to enable encryption at rest for Kinesis Firehose
* You can configure Kinesis Firehose's buffering (see Parameters above)
* Athena's Named Queries cannot be updated (you need to create a new query with a different logical name)
* Make sure the S3 bucket is empty when you delete the stack