https://github.com/gakas14/aws-serverless-data-lake

This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.
https://github.com/gakas14/aws-serverless-data-lake

athena aws data-lake etl glue-catalog glue-etl kinesis-firehose kinesis-stream s3 sql

Last synced: 5 months ago
JSON representation

Host: GitHub
URL: https://github.com/gakas14/aws-serverless-data-lake
Owner: gakas14
Created: 2022-11-18T07:26:01.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-11-23T07:49:36.000Z (over 2 years ago)
Last Synced: 2024-12-31T12:34:33.589Z (7 months ago)
Topics: athena, aws, data-lake, etl, glue-catalog, glue-etl, kinesis-firehose, kinesis-stream, s3, sql
Language: Jupyter Notebook
Homepage:
Size: 56.6 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# AWS-Serverless-Data-Lake

To demonstrate the power of data lake architectures,
In this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.

1. Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)
2. Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake

Screen Shot 2022-11-18 at 3 45 07 PM

3. Install the Kinesis Data Generator Tool (KDG)
Screen Shot 2022-11-18 at 3 45 57 PM

Monitoring for the Firehose Delivery Stream
Screen Shot 2022-11-18 at 3 46 15 PM

Amazon Kinesis Firehose writes data to Amazon S3
Screen Shot 2022-11-18 at 3 47 52 PM

4. Cataloging your Data with AWS Glue
- Create crawler to auto discover schema of your data in S3

Screen Shot 2022-11-18 at 3 55 17 PM

- Create a database and a table then Edit the Metadata Schema

5. Create a Transformation Job with Glue Studio
Screen Shot 2022-11-18 at 4 00 28 PM

Screen Shot 2022-11-18 at 4 01 24 PM

6. SQL analytics on a Large Scale Open Dataset usimg AWS Athena

- create a database
CREATE DATABASE gdelt;

- Create Metadata Table for GDELT EVENTS Data
CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events (
`globaleventid` INT,
`day` INT,
`monthyear` INT,
`year` INT,
`fractiondate` FLOAT,
`actor1code` string,
`actor1name` string,
`actor1countrycode` string,
`actor1knowngroupcode` string,
`actor1ethniccode` string,
`actor1religion1code` string,
`actor1religion2code` string,
`actor1type1code` string,
`actor1type2code` string,
`actor1type3code` string,
`actor2code` string,
`actor2name` string,
`actor2countrycode` string,
`actor2knowngroupcode` string,
`actor2ethniccode` string,
`actor2religion1code` string,
`actor2religion2code` string,
`actor2type1code` string,
`actor2type2code` string,
`actor2type3code` string,
`isrootevent` BOOLEAN,
`eventcode` string,
`eventbasecode` string,
`eventrootcode` string,
`quadclass` INT,
`goldsteinscale` FLOAT,
`nummentions` INT,
`numsources` INT,
`numarticles` INT,
`avgtone` FLOAT,
`actor1geo_type` INT,
`actor1geo_fullname` string,
`actor1geo_countrycode` string,
`actor1geo_adm1code` string,
`actor1geo_lat` FLOAT,
`actor1geo_long` FLOAT,
`actor1geo_featureid` INT,
`actor2geo_type` INT,
`actor2geo_fullname` string,
`actor2geo_countrycode` string,
`actor2geo_adm1code` string,
`actor2geo_lat` FLOAT,
`actor2geo_long` FLOAT,
`actor2geo_featureid` INT,
`actiongeo_type` INT,
`actiongeo_fullname` string,
`actiongeo_countrycode` string,
`actiongeo_adm1code` string,
`actiongeo_lat` FLOAT,
`actiongeo_long` FLOAT,
`actiongeo_featureid` INT,
`dateadded` INT,
`sourceurl` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 'serialization.format' = '\t',
'field.delim' = '\t')
LOCATION 's3://gdelt-open-data/events/';

- Create Metadata Table for GDELT Lookup Tables
Screen Shot 2022-11-18 at 4 05 48 PM

Screen Shot 2022-11-18 at 4 05 59 PM

Screen Shot 2022-11-18 at 4 06 05 PM

- Example output:
Screen Shot 2022-11-18 at 4 08 06 PM

Screen Shot 2022-11-18 at 4 08 14 PM

Screen Shot 2022-11-18 at 4 08 24 PM

Screen Shot 2022-11-18 at 4 08 32 PM

This workshop is base on AWS workshop studio the link is below.
https://catalog.us-east-1.prod.workshops.aws/workshops/ea7ddf16-5e0a-4ec7-b54e-5cadf3028b78/en-US

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gakas14/aws-serverless-data-lake

Awesome Lists containing this project

README