https://github.com/gakas14/aws-serverless-data-lake
This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.
https://github.com/gakas14/aws-serverless-data-lake
athena aws data-lake etl glue-catalog glue-etl kinesis-firehose kinesis-stream s3 sql
Last synced: 2 months ago
JSON representation
This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.
- Host: GitHub
- URL: https://github.com/gakas14/aws-serverless-data-lake
- Owner: gakas14
- Created: 2022-11-18T07:26:01.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-11-23T07:49:36.000Z (over 2 years ago)
- Last Synced: 2024-12-31T12:34:33.589Z (4 months ago)
- Topics: athena, aws, data-lake, etl, glue-catalog, glue-etl, kinesis-firehose, kinesis-stream, s3, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 56.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AWS-Serverless-Data-Lake
To demonstrate the power of data lake architectures,
In this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.1. Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)
2. Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake
3. Install the Kinesis Data Generator Tool (KDG)
Monitoring for the Firehose Delivery Stream
Amazon Kinesis Firehose writes data to Amazon S3
4. Cataloging your Data with AWS Glue
- Create crawler to auto discover schema of your data in S3
![]()
- Create a database and a table then Edit the Metadata Schema
5. Create a Transformation Job with Glue Studio
![]()
![]()
6. SQL analytics on a Large Scale Open Dataset usimg AWS Athena- create a database
CREATE DATABASE gdelt;
- Create Metadata Table for GDELT EVENTS Data
CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events (
`globaleventid` INT,
`day` INT,
`monthyear` INT,
`year` INT,
`fractiondate` FLOAT,
`actor1code` string,
`actor1name` string,
`actor1countrycode` string,
`actor1knowngroupcode` string,
`actor1ethniccode` string,
`actor1religion1code` string,
`actor1religion2code` string,
`actor1type1code` string,
`actor1type2code` string,
`actor1type3code` string,
`actor2code` string,
`actor2name` string,
`actor2countrycode` string,
`actor2knowngroupcode` string,
`actor2ethniccode` string,
`actor2religion1code` string,
`actor2religion2code` string,
`actor2type1code` string,
`actor2type2code` string,
`actor2type3code` string,
`isrootevent` BOOLEAN,
`eventcode` string,
`eventbasecode` string,
`eventrootcode` string,
`quadclass` INT,
`goldsteinscale` FLOAT,
`nummentions` INT,
`numsources` INT,
`numarticles` INT,
`avgtone` FLOAT,
`actor1geo_type` INT,
`actor1geo_fullname` string,
`actor1geo_countrycode` string,
`actor1geo_adm1code` string,
`actor1geo_lat` FLOAT,
`actor1geo_long` FLOAT,
`actor1geo_featureid` INT,
`actor2geo_type` INT,
`actor2geo_fullname` string,
`actor2geo_countrycode` string,
`actor2geo_adm1code` string,
`actor2geo_lat` FLOAT,
`actor2geo_long` FLOAT,
`actor2geo_featureid` INT,
`actiongeo_type` INT,
`actiongeo_fullname` string,
`actiongeo_countrycode` string,
`actiongeo_adm1code` string,
`actiongeo_lat` FLOAT,
`actiongeo_long` FLOAT,
`actiongeo_featureid` INT,
`dateadded` INT,
`sourceurl` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 'serialization.format' = '\t',
'field.delim' = '\t')
LOCATION 's3://gdelt-open-data/events/';
- Create Metadata Table for GDELT Lookup Tables
![]()
- Example output:
This workshop is base on AWS workshop studio the link is below.
https://catalog.us-east-1.prod.workshops.aws/workshops/ea7ddf16-5e0a-4ec7-b54e-5cadf3028b78/en-US