https://github.com/databricks/iceberg-aws-ext

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/databricks/iceberg-aws-ext
Owner: databricks
Created: 2022-11-02T02:31:13.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-11-03T19:40:51.000Z (over 3 years ago)
Last Synced: 2025-04-09T16:11:53.378Z (12 months ago)
Language: Java
Size: 72.3 KB
Stars: 3
Watchers: 8
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ## Iceberg S3FileIO with corruption check

This is a modified version of S3FileIO from Iceberg 1.0.0 that performs a corruption check on Parquet files before

uploading them to S3. The actual check is done in `ParquetUtil.java`, which simply reads and decompresses the data

in the file. (The only code modified from Iceberg is `S3OutputStream.java` besides package naming.)

There are a couple of limitations. The first is that multipart upload is disabled so that the entire file is written

locally first. The second is that it relies on the file having a `.parquet` extension, to avoid checking metadata or

other non-Parquet files.

If you want to try it out, include this jar and set your catalog's FileIO implementation via

a property, e.g. `spark.sql.catalog.tabular.io-impl=io.tabular.iceberg.aws.s3.S3FileIO`.

Disclaimer: this was very lightly tested and not at scale so use at your own risk. This is meant 

more as an example of one way to add a corruption check.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/databricks/iceberg-aws-ext

Awesome Lists containing this project

README