An open API service indexing awesome lists of open source software.

https://github.com/gakas14/aws-serverless-data-lake

This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.
https://github.com/gakas14/aws-serverless-data-lake

athena aws data-lake etl glue-catalog glue-etl kinesis-firehose kinesis-stream s3 sql

Last synced: 2 months ago
JSON representation

This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.

Awesome Lists containing this project

README

        

# AWS-Serverless-Data-Lake

To demonstrate the power of data lake architectures,
In this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.

1. Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)
2. Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake

Screen Shot 2022-11-18 at 3 45 07 PM

3. Install the Kinesis Data Generator Tool (KDG)
Screen Shot 2022-11-18 at 3 45 57 PM

Monitoring for the Firehose Delivery Stream
Screen Shot 2022-11-18 at 3 46 15 PM

Amazon Kinesis Firehose writes data to Amazon S3
Screen Shot 2022-11-18 at 3 47 52 PM

4. Cataloging your Data with AWS Glue
- Create crawler to auto discover schema of your data in S3

Screen Shot 2022-11-18 at 3 55 17 PM

- Create a database and a table then Edit the Metadata Schema


5. Create a Transformation Job with Glue Studio
Screen Shot 2022-11-18 at 4 00 28 PM

Screen Shot 2022-11-18 at 4 01 24 PM

6. SQL analytics on a Large Scale Open Dataset usimg AWS Athena

- create a database
CREATE DATABASE gdelt;

- Create Metadata Table for GDELT EVENTS Data
CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events (
`globaleventid` INT,
`day` INT,
`monthyear` INT,
`year` INT,
`fractiondate` FLOAT,
`actor1code` string,
`actor1name` string,
`actor1countrycode` string,
`actor1knowngroupcode` string,
`actor1ethniccode` string,
`actor1religion1code` string,
`actor1religion2code` string,
`actor1type1code` string,
`actor1type2code` string,
`actor1type3code` string,
`actor2code` string,
`actor2name` string,
`actor2countrycode` string,
`actor2knowngroupcode` string,
`actor2ethniccode` string,
`actor2religion1code` string,
`actor2religion2code` string,
`actor2type1code` string,
`actor2type2code` string,
`actor2type3code` string,
`isrootevent` BOOLEAN,
`eventcode` string,
`eventbasecode` string,
`eventrootcode` string,
`quadclass` INT,
`goldsteinscale` FLOAT,
`nummentions` INT,
`numsources` INT,
`numarticles` INT,
`avgtone` FLOAT,
`actor1geo_type` INT,
`actor1geo_fullname` string,
`actor1geo_countrycode` string,
`actor1geo_adm1code` string,
`actor1geo_lat` FLOAT,
`actor1geo_long` FLOAT,
`actor1geo_featureid` INT,
`actor2geo_type` INT,
`actor2geo_fullname` string,
`actor2geo_countrycode` string,
`actor2geo_adm1code` string,
`actor2geo_lat` FLOAT,
`actor2geo_long` FLOAT,
`actor2geo_featureid` INT,
`actiongeo_type` INT,
`actiongeo_fullname` string,
`actiongeo_countrycode` string,
`actiongeo_adm1code` string,
`actiongeo_lat` FLOAT,
`actiongeo_long` FLOAT,
`actiongeo_featureid` INT,
`dateadded` INT,
`sourceurl` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 'serialization.format' = '\t',
'field.delim' = '\t')
LOCATION 's3://gdelt-open-data/events/';

- Create Metadata Table for GDELT Lookup Tables
Screen Shot 2022-11-18 at 4 05 48 PM

Screen Shot 2022-11-18 at 4 05 54 PM

Screen Shot 2022-11-18 at 4 05 59 PM

Screen Shot 2022-11-18 at 4 06 05 PM

- Example output:
Screen Shot 2022-11-18 at 4 08 06 PM

Screen Shot 2022-11-18 at 4 08 14 PM

Screen Shot 2022-11-18 at 4 08 24 PM

Screen Shot 2022-11-18 at 4 08 32 PM

This workshop is base on AWS workshop studio the link is below.
https://catalog.us-east-1.prod.workshops.aws/workshops/ea7ddf16-5e0a-4ec7-b54e-5cadf3028b78/en-US