https://github.com/ev2900/iceberg_update_metadata_script
Python script that will update S3 file paths in Iceberg metadata files (metadata.json + AVRO)
https://github.com/ev2900/iceberg_update_metadata_script
apache-iceberg aws aws-glue glue iceberg python
Last synced: 6 months ago
JSON representation
Python script that will update S3 file paths in Iceberg metadata files (metadata.json + AVRO)
- Host: GitHub
- URL: https://github.com/ev2900/iceberg_update_metadata_script
- Owner: ev2900
- Created: 2024-10-11T13:16:24.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-03-18T03:23:57.000Z (7 months ago)
- Last Synced: 2025-03-27T05:33:41.569Z (7 months ago)
- Topics: apache-iceberg, aws, aws-glue, glue, iceberg, python
- Language: Python
- Homepage:
- Size: 725 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Python script to update S3 file paths in Iceberg metadata
![]()
![]()
When you create an Apache Iceberg table on S3 the Iceberg table has both data files and metadata files. If you physically copy the files that make an Iceberg table to another S3 bucket the metadata files need to be updated.
The metadata files (metadata.json files and AVRO files) have fields that reference the S3 path of the AVRO and data files. When you copy the files that make an Iceberg table to another S3 bucket the S3 path references will still be to the old / S3 bucket the files were copied from.
For example, I have an Iceberg table in S3 bucket A. I copy the data files and metadatafiles from bucket A to bucket B. The metadata.json files and AVRO contain references to S3 bucket A. We need to update these to bucket B since this Iceberg table is now stored / was copied to S3 bucket B.
After we updated the S3 references we can optionally [register](https://github.com/ev2900/Iceberg_Glue_register_table) the updated metadata.json as a new Glue data catalog entry. An example of using the ```register_table``` command with AWS Glue is avaiable in the [Iceberg_Glue_register_table](https://github.com/ev2900/Iceberg_Glue_register_table) repository.
## Example using Glue python shell job
Launch the CloudFormation stack below to deploy a Glue python shell script that can be used to update the metadata.json and AVRO files.
[](https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=iceberg-update-metadata&templateURL=https://sharkech-public.s3.amazonaws.com/misc-public/iceberg_update_metadata_script.yaml)
After you deploy the CloudFormation stack. You need to update a section of python script. Navigate to the [Glue console](https://us-east-1.console.aws.amazon.com/glue/home) click on **ETL jobs**, then select the **Update Iceberg Metadata**, then click on the **Actions** drop down then **Edit jobs**
In the Glue python script, you need to configure 4 python variables.
```
# Adjust the values of these variables before running the script
s3_bucket_name_w_metadata_to_update = '' # ex. register-iceberg-2ut1suuihxyq
folder_path_to_metadata = '' # ex. iceberg/iceberg.db/sampledataicebergtable/metadata/
old_s3_bucket_name_or_path = '' # ex. glue-iceberg-from-jars-s3bucket-2ut1suuihxyq
new_s3_bucket_name_or_path = '' # ex. register-iceberg-2ut1suuihxyq
```After updating these variables click on the **Save** and then **Run** button.
If you are running this script and updating the S3 references in the metadata.json and AVRO files with the intent of using the [register_table](https://github.com/ev2900/Iceberg_Glue_register_table) command.
The python script outputs the path of the latest metadata.json file for the Iceberg table. This can be directly input into the [register_table](https://github.com/ev2900/Iceberg_Glue_register_table) command.
To find this output access the Cloudwatch Output logs for the Glue job run.
If you navigate to the end of the log stream you will see a log message that provides the file path you can use with [register_table](https://github.com/ev2900/Iceberg_Glue_register_table)
![]()