https://github.com/ev2900/iceberg_update_metadata_script

Python script that will update S3 file paths in Iceberg metadata files (metadata.json + AVRO)
https://github.com/ev2900/iceberg_update_metadata_script

apache-iceberg aws aws-glue glue iceberg python

Last synced: 6 months ago
JSON representation

Python script that will update S3 file paths in Iceberg metadata files (metadata.json + AVRO)

Host: GitHub
URL: https://github.com/ev2900/iceberg_update_metadata_script
Owner: ev2900
Created: 2024-10-11T13:16:24.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-03-18T03:23:57.000Z (7 months ago)
Last Synced: 2025-03-27T05:33:41.569Z (7 months ago)
Topics: apache-iceberg, aws, aws-glue, glue, iceberg, python
Language: Python
Homepage:
Size: 725 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Python script to update S3 file paths in Iceberg metadata

When you create an Apache Iceberg table on S3 the Iceberg table has both data files and metadata files. If you physically copy the files that make an Iceberg table to another S3 bucket the metadata files need to be updated.

The metadata files (metadata.json files and AVRO files) have fields that reference the S3 path of the AVRO and data files. When you copy the files that make an Iceberg table to another S3 bucket the S3 path references will still be to the old / S3 bucket the files were copied from.

For example, I have an Iceberg table in S3 bucket A. I copy the data files and metadatafiles from bucket A to bucket B. The metadata.json files and AVRO contain references to S3 bucket A. We need to update these to bucket B since this Iceberg table is now stored / was copied to S3 bucket B.

After we updated the S3 references we can optionally [register](https://github.com/ev2900/Iceberg_Glue_register_table) the updated metadata.json as a new Glue data catalog entry. An example of using the ```register_table``` command with AWS Glue is avaiable in the [Iceberg_Glue_register_table](https://github.com/ev2900/Iceberg_Glue_register_table) repository.

## Example using Glue python shell job

Launch the CloudFormation stack below to deploy a Glue python shell script that can be used to update the metadata.json and AVRO files.

[![Launch CloudFormation Stack](https://sharkech-public.s3.amazonaws.com/misc-public/cloudformation-launch-stack.png)](https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=iceberg-update-metadata&templateURL=https://sharkech-public.s3.amazonaws.com/misc-public/iceberg_update_metadata_script.yaml)

After you deploy the CloudFormation stack. You need to update a section of python script. Navigate to the [Glue console](https://us-east-1.console.aws.amazon.com/glue/home) click on **ETL jobs**, then select the **Update Iceberg Metadata**, then click on the **Actions** drop down then **Edit jobs**

quick_setup

In the Glue python script, you need to configure 4 python variables.

```
# Adjust the values of these variables before running the script
s3_bucket_name_w_metadata_to_update = '' # ex. register-iceberg-2ut1suuihxyq
folder_path_to_metadata = '' # ex. iceberg/iceberg.db/sampledataicebergtable/metadata/
old_s3_bucket_name_or_path = '' # ex. glue-iceberg-from-jars-s3bucket-2ut1suuihxyq
new_s3_bucket_name_or_path = '' # ex. register-iceberg-2ut1suuihxyq
```

After updating these variables click on the **Save** and then **Run** button.

quick_setup

If you are running this script and updating the S3 references in the metadata.json and AVRO files with the intent of using the [register_table](https://github.com/ev2900/Iceberg_Glue_register_table) command.

The python script outputs the path of the latest metadata.json file for the Iceberg table. This can be directly input into the [register_table](https://github.com/ev2900/Iceberg_Glue_register_table) command.

To find this output access the Cloudwatch Output logs for the Glue job run.

quick_setup

If you navigate to the end of the log stream you will see a log message that provides the file path you can use with [register_table](https://github.com/ev2900/Iceberg_Glue_register_table)

quick_setup

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ev2900/iceberg_update_metadata_script

Awesome Lists containing this project

README