Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dacort/modern-data-lake-storage-layers
Jupyter notebooks and AWS CloudFormation template to show how Hudi, Iceberg, and Delta Lake work
https://github.com/dacort/modern-data-lake-storage-layers
amazon-emr apache-hudi apache-iceberg aws delta-lake hudi iceberg
Last synced: 5 days ago
JSON representation
Jupyter notebooks and AWS CloudFormation template to show how Hudi, Iceberg, and Delta Lake work
- Host: GitHub
- URL: https://github.com/dacort/modern-data-lake-storage-layers
- Owner: dacort
- License: cc0-1.0
- Created: 2022-02-02T18:36:48.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2022-07-13T05:01:43.000Z (over 2 years ago)
- Last Synced: 2023-04-11T15:32:43.199Z (almost 2 years ago)
- Topics: amazon-emr, apache-hudi, apache-iceberg, aws, delta-lake, hudi, iceberg
- Language: Jupyter Notebook
- Homepage: https://dacort.dev/posts/modern-data-lake-storage-layers/
- Size: 262 KB
- Stars: 33
- Watchers: 1
- Forks: 21
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Modern Data Lake Storage Layers
This repository contains supporting assets for my research in modern Data Lake storage layers like Apache Hudi, Apache Iceberg, and Delta Lake.
Specifically, there's a [CloudFormation template](cloudformation/emr-studio-cluster.cfn.yaml) to create an EMR cluster and EMR Studio with the necessary requirements and Jupyter notebooks with the example walkthroughs.
You can view the corresponding [blog post](https://dacort.dev/posts/modern-data-lake-storage-layers/) and [video](https://www.youtube.com/watch?v=fryfx0Zg7KA)
## Pre-requisites
You'll need an AWS Account in which you have administrator privileges and the ability to deploy a CloudFormation template. The template **will** create an EMR Cluster and S3 bucket that will incur charges - be sure to either shut down the cluster when done or delete the CloudFormation stack. In order to delete the CloudFormation stack, you'll need to:
- Manually delete any EMR Studio Workspaces you created
- Manually empty the S3 bucket created by CloudFormation
- Manually delete the VPC created by CloudFormation due to auto-created rules## Overview
The included CloudFormation template creates a new VPC and EMR Cluster for you to be able to run the notebooks. An EMR Studio is also created and you can find the Studio URL in the `Outputs` tab of your CloudFormation Stack.
Once the stack is done creating, you'll need to navigate to EMR Studio and create a new workspace attached to the "data-lakes" cluster.
data:image/s3,"s3://crabby-images/edde2/edde2475622c7f3c8945e8a353a38e78401618a3" alt=""
Inside the workspace you either upload each notebook individually from the [notebooks/](notebooks/) folder or simply connect to this repository by using the "Git" icon on the left-hand side.