{"id":18909494,"url":"https://github.com/gakas14/aws-serverless-data-lake","last_synced_at":"2026-04-26T20:32:15.189Z","repository":{"id":63404783,"uuid":"567620585","full_name":"gakas14/AWS-Serverless-Data-Lake","owner":"gakas14","description":"This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.","archived":false,"fork":false,"pushed_at":"2022-11-23T07:49:36.000Z","size":58,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-01T02:12:05.643Z","etag":null,"topics":["athena","aws","data-lake","etl","glue-catalog","glue-etl","kinesis-firehose","kinesis-stream","s3","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gakas14.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-18T07:26:01.000Z","updated_at":"2022-11-18T08:41:04.000Z","dependencies_parsed_at":"2023-01-23T18:00:15.095Z","dependency_job_id":null,"html_url":"https://github.com/gakas14/AWS-Serverless-Data-Lake","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gakas14/AWS-Serverless-Data-Lake","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gakas14%2FAWS-Serverless-Data-Lake","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gakas14%2FAWS-Serverless-Data-Lake/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gakas14%2FAWS-Serverless-Data-Lake/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gakas14%2FAWS-Serverless-Data-Lake/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gakas14","download_url":"https://codeload.github.com/gakas14/AWS-Serverless-Data-Lake/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gakas14%2FAWS-Serverless-Data-Lake/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32312276,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-26T19:15:34.056Z","status":"ssl_error","status_checked_at":"2026-04-26T19:15:15.467Z","response_time":129,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["athena","aws","data-lake","etl","glue-catalog","glue-etl","kinesis-firehose","kinesis-stream","s3","sql"],"created_at":"2024-11-08T09:34:01.002Z","updated_at":"2026-04-26T20:32:15.169Z","avatar_url":"https://github.com/gakas14.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS-Serverless-Data-Lake\n\nTo demonstrate the power of data lake architectures, \nIn this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.\n\n\n1.  Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)\n2. Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake\n\n\u003cimg width=\"1269\" alt=\"Screen Shot 2022-11-18 at 3 45 07 PM\" src=\"https://user-images.githubusercontent.com/74584964/202648990-a0383525-6e2d-42e4-918c-b2a677cc1635.png\"\u003e\n\n3. Install the Kinesis Data Generator Tool (KDG)\n\u003cimg width=\"1168\" alt=\"Screen Shot 2022-11-18 at 3 45 57 PM\" src=\"https://user-images.githubusercontent.com/74584964/202649439-55ef1789-1a42-49b4-bcce-46ad8e052c4e.png\"\u003e\n\n Monitoring for the Firehose Delivery Stream\n\u003cimg width=\"891\" alt=\"Screen Shot 2022-11-18 at 3 46 15 PM\" src=\"https://user-images.githubusercontent.com/74584964/202649517-8c0485c7-e9a5-43cb-bc98-e1795fe229a8.png\"\u003e\n\nAmazon Kinesis Firehose writes data to Amazon S3\n\u003cimg width=\"1251\" alt=\"Screen Shot 2022-11-18 at 3 47 52 PM\" src=\"https://user-images.githubusercontent.com/74584964/202649610-8281f5da-9da0-4840-b980-0a4896f703bf.png\"\u003e\n\n4. Cataloging your Data with AWS Glue\n  - Create crawler to auto discover schema of your data in S3\n\n\u003cimg width=\"1267\" alt=\"Screen Shot 2022-11-18 at 3 55 17 PM\" src=\"https://user-images.githubusercontent.com/74584964/202650222-5b5f9847-78a9-445e-b103-e31ff1009ae5.png\"\u003e\n  \n   - Create a database and a table then Edit the Metadata Schema\n      \n   \n5. Create a Transformation Job with Glue Studio\n  \u003cimg width=\"1251\" alt=\"Screen Shot 2022-11-18 at 4 00 28 PM\" src=\"https://user-images.githubusercontent.com/74584964/202651242-cd1bcb29-ba09-4b4b-8212-2986ba239059.png\"\u003e\n  \n  \u003cimg width=\"1201\" alt=\"Screen Shot 2022-11-18 at 4 01 24 PM\" src=\"https://user-images.githubusercontent.com/74584964/202651251-8d081150-cacd-4f26-bc5d-58bef371a146.png\"\u003e\n  \n6. SQL analytics on a Large Scale Open Dataset usimg AWS Athena\n\n - create a database\n    CREATE DATABASE gdelt;\n \n - Create Metadata Table for GDELT EVENTS Data\n  CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events (\n        `globaleventid` INT,\n        `day` INT,\n        `monthyear` INT,\n        `year` INT,\n        `fractiondate` FLOAT,\n        `actor1code` string,\n        `actor1name` string,\n        `actor1countrycode` string,\n        `actor1knowngroupcode` string,\n        `actor1ethniccode` string,\n        `actor1religion1code` string,\n        `actor1religion2code` string,\n        `actor1type1code` string,\n        `actor1type2code` string,\n        `actor1type3code` string,\n        `actor2code` string,\n        `actor2name` string,\n        `actor2countrycode` string,\n        `actor2knowngroupcode` string,\n        `actor2ethniccode` string,\n        `actor2religion1code` string,\n        `actor2religion2code` string,\n        `actor2type1code` string,\n        `actor2type2code` string,\n        `actor2type3code` string,\n        `isrootevent` BOOLEAN,\n        `eventcode` string,\n        `eventbasecode` string,\n        `eventrootcode` string,\n        `quadclass` INT,\n        `goldsteinscale` FLOAT,\n        `nummentions` INT,\n        `numsources` INT,\n        `numarticles` INT,\n        `avgtone` FLOAT,\n        `actor1geo_type` INT,\n        `actor1geo_fullname` string,\n        `actor1geo_countrycode` string,\n        `actor1geo_adm1code` string,\n        `actor1geo_lat` FLOAT,\n        `actor1geo_long` FLOAT,\n        `actor1geo_featureid` INT,\n        `actor2geo_type` INT,\n        `actor2geo_fullname` string,\n        `actor2geo_countrycode` string,\n        `actor2geo_adm1code` string,\n        `actor2geo_lat` FLOAT,\n        `actor2geo_long` FLOAT,\n        `actor2geo_featureid` INT,\n        `actiongeo_type` INT,\n        `actiongeo_fullname` string,\n        `actiongeo_countrycode` string,\n        `actiongeo_adm1code` string,\n        `actiongeo_lat` FLOAT,\n        `actiongeo_long` FLOAT,\n        `actiongeo_featureid` INT,\n        `dateadded` INT,\n        `sourceurl` string\n)\nROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'\nWITH SERDEPROPERTIES ( 'serialization.format' = '\\t',\n            'field.delim' = '\\t')\nLOCATION 's3://gdelt-open-data/events/';\n  \n  - Create Metadata Table for GDELT Lookup Tables\n  \u003cimg width=\"924\" alt=\"Screen Shot 2022-11-18 at 4 05 48 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652225-6dd12a15-4ae4-4d60-9d1f-2e12dba2259f.png\"\u003e\n  \n  \u003cimg width=\"915\" alt=\"Screen Shot 2022-11-18 at 4 05 54 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652239-99ca12c9-9d15-49d0-9584-b366ac024670.png\"\u003e\n\n\u003cimg width=\"917\" alt=\"Screen Shot 2022-11-18 at 4 05 59 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652254-b178f6e3-5f2d-4705-ac0d-21c8f3b396f2.png\"\u003e\n\n\u003cimg width=\"916\" alt=\"Screen Shot 2022-11-18 at 4 06 05 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652271-e8e32ed8-f3ca-423c-b236-c49384c4e81c.png\"\u003e\n\n  - Example output:\n  \u003cimg width=\"785\" alt=\"Screen Shot 2022-11-18 at 4 08 06 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652569-bd806688-93e5-4e43-bb80-f748418641a0.png\"\u003e\n\n\u003cimg width=\"789\" alt=\"Screen Shot 2022-11-18 at 4 08 14 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652586-f02fa64b-11f5-4aa2-852d-185108d1ad11.png\"\u003e\n\n\n\u003cimg width=\"691\" alt=\"Screen Shot 2022-11-18 at 4 08 24 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652595-1d25b22f-cb17-468b-b5f0-d07cac5c1057.png\"\u003e\n\n\n\u003cimg width=\"698\" alt=\"Screen Shot 2022-11-18 at 4 08 32 PM\" src=\"https://user-images.githubusercontent.com/74584964/202652621-790b106f-94e2-495c-83c4-68f02d609a11.png\"\u003e\n\n\nThis workshop is base on AWS workshop studio the link is below.\nhttps://catalog.us-east-1.prod.workshops.aws/workshops/ea7ddf16-5e0a-4ec7-b54e-5cadf3028b78/en-US\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgakas14%2Faws-serverless-data-lake","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgakas14%2Faws-serverless-data-lake","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgakas14%2Faws-serverless-data-lake/lists"}