Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bluishglc/apache-hudi-core-conceptions
A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.
https://github.com/bluishglc/apache-hudi-core-conceptions
Last synced: 5 days ago
JSON representation
A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.
- Host: GitHub
- URL: https://github.com/bluishglc/apache-hudi-core-conceptions
- Owner: bluishglc
- Created: 2023-03-18T01:13:23.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-08-22T03:11:37.000Z (about 1 year ago)
- Last Synced: 2024-08-02T14:07:57.849Z (3 months ago)
- Language: Jupyter Notebook
- Size: 207 KB
- Stars: 10
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Apache Hudi Core Conceptions
A set of notebooks to explore and explain core conceptions of Apache Hudi, such as file layouts, file sizing, compaction, clustering and so on.
① The notebooks manipulate a public dataset: amazon-reviews-pds, the location is s3://amazon-reviews-pds, it is accessible on aws global regions, for China regions or non aws users, you can download it to local with S3 client tools.
② The running environment of notebooks is Amazon EMR Studio, a managed notebook service for Amazon EMR. If you have no aws accounts, you can modify notebooks to adapt to a notebook environment which supports Spark kernal.
③ The recommended configuration for Spark cluster is: 32 vCore,120GB or higher, the master node must have 100GB+ free disk space.
---
Update Notes
@2023-08-22: The public dateset "amazon-reviews-pds" on s3://amazon-reviews-pds is closed recently, you can download raw data from: [https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/), but the data format and schema are different with original parquet files on s3://amazon-reviews-pds, you need clean & format raw data by yourself.