Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bluishglc/apache-hudi-core-conceptions

A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.
https://github.com/bluishglc/apache-hudi-core-conceptions

Last synced: 5 days ago
JSON representation

A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.

Host: GitHub
URL: https://github.com/bluishglc/apache-hudi-core-conceptions
Owner: bluishglc
Created: 2023-03-18T01:13:23.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2023-08-22T03:11:37.000Z (about 1 year ago)
Last Synced: 2024-08-02T14:07:57.849Z (3 months ago)
Language: Jupyter Notebook
Size: 207 KB
Stars: 10
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Apache Hudi Core Conceptions

A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.

① The notebooks manipulate a public dataset: amazon-reviews-pds, the location is s3://amazon-reviews-pds, it is accessible on aws global regions, for China regions or non aws users, you can download it to local with S3 client tools. 

② The running environment of notebooks is Amazon EMR Studio, a managed notebook service for Amazon EMR. If you have no aws accounts, you can modify notebooks to adapt to a notebook environment which supports Spark kernal.

③ The recommended configuration for Spark cluster is: 32 vCore，120GB or higher, the master node must have 100GB+ free disk space.

---

Update Notes

@2023-08-22: The public dateset "amazon-reviews-pds" on s3://amazon-reviews-pds is closed recently, you can download raw data from: [https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/), but the data format and schema are different with original parquet files on s3://amazon-reviews-pds, you need clean & format raw data by yourself.