{"id":17601055,"url":"https://github.com/ev2900/iceberg_update_metadata_script","last_synced_at":"2025-04-13T14:34:04.336Z","repository":{"id":258043100,"uuid":"871198566","full_name":"ev2900/Iceberg_update_metadata_script","owner":"ev2900","description":"Python script that will update S3 file paths in Iceberg metadata files (metadata.json + AVRO)","archived":false,"fork":false,"pushed_at":"2025-03-18T03:23:57.000Z","size":742,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-27T05:33:41.569Z","etag":null,"topics":["apache-iceberg","aws","aws-glue","glue","iceberg","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ev2900.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-11T13:16:24.000Z","updated_at":"2025-03-18T03:24:02.000Z","dependencies_parsed_at":"2025-01-03T11:37:51.216Z","dependency_job_id":"3a94ee25-f0d8-4f62-962e-4eaedcb69262","html_url":"https://github.com/ev2900/Iceberg_update_metadata_script","commit_stats":{"total_commits":42,"total_committers":1,"mean_commits":42.0,"dds":0.0,"last_synced_commit":"eb238bcde1eb2d14cf26bf13bc6bcacd96c039f3"},"previous_names":["ev2900/iceberg_update_metadata_script"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ev2900%2FIceberg_update_metadata_script","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ev2900%2FIceberg_update_metadata_script/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ev2900%2FIceberg_update_metadata_script/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ev2900%2FIceberg_update_metadata_script/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ev2900","download_url":"https://codeload.github.com/ev2900/Iceberg_update_metadata_script/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248728634,"owners_count":21152260,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-iceberg","aws","aws-glue","glue","iceberg","python"],"created_at":"2024-10-22T12:08:20.043Z","updated_at":"2025-04-13T14:34:04.324Z","avatar_url":"https://github.com/ev2900.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python script to update S3 file paths in Iceberg metadata\n\n\u003cimg width=\"275\" alt=\"map-user\" src=\"https://img.shields.io/badge/cloudformation template deployments-69-blue\"\u003e \u003cimg width=\"85\" alt=\"map-user\" src=\"https://img.shields.io/badge/views-958-green\"\u003e \u003cimg width=\"125\" alt=\"map-user\" src=\"https://img.shields.io/badge/unique visits-336-green\"\u003e\n\nWhen you create an Apache Iceberg table on S3 the Iceberg table has both data files and metadata files. If you physically copy the files that make an Iceberg table to another S3 bucket the metadata files need to be updated.\n\nThe metadata files (metadata.json files and AVRO files) have fields that reference the S3 path of the AVRO and data files. When you copy the files that make an Iceberg table to another S3 bucket the S3 path references will still be to the old / S3 bucket the files were copied from.\n\nFor example, I have an Iceberg table in S3 bucket A. I copy the data files and metadatafiles from bucket A to bucket B. The metadata.json files and AVRO contain references to S3 bucket A. We need to update these to bucket B since this Iceberg table is now stored / was copied to S3 bucket B.\n\nAfter we updated the S3 references we can optionally [register](https://github.com/ev2900/Iceberg_Glue_register_table) the updated metadata.json as a new Glue data catalog entry. An example of using the ```register_table``` command with AWS Glue is avaiable in the [Iceberg_Glue_register_table](https://github.com/ev2900/Iceberg_Glue_register_table) repository.\n\n## Example using Glue python shell job\n\nLaunch the CloudFormation stack below to deploy a Glue python shell script that can be used to update the metadata.json and AVRO files.\n\n[![Launch CloudFormation Stack](https://sharkech-public.s3.amazonaws.com/misc-public/cloudformation-launch-stack.png)](https://console.aws.amazon.com/cloudformation/home#/stacks/new?stackName=iceberg-update-metadata\u0026templateURL=https://sharkech-public.s3.amazonaws.com/misc-public/iceberg_update_metadata_script.yaml)\n\nAfter you deploy the CloudFormation stack. You need to update a section of python script. Navigate to the [Glue console](https://us-east-1.console.aws.amazon.com/glue/home) click on **ETL jobs**, then select the **Update Iceberg Metadata**, then click on the **Actions** drop down then **Edit jobs**\n\n\u003cimg width=\"800\" alt=\"quick_setup\" src=\"https://github.com/ev2900/Iceberg_update_metadata_script/blob/main/README/glue_console.png\"\u003e\n\nIn the Glue python script, you need to configure 4 python variables.\n\n```\n# Adjust the values of these variables before running the script\ns3_bucket_name_w_metadata_to_update = '\u003cs3 bucket name that has the Iceberg metadata that you want to update\u003e' # ex. register-iceberg-2ut1suuihxyq\nfolder_path_to_metadata = '\u003cpath to the Iceberg metadata folder in the ^ bucket\u003e' # ex. iceberg/iceberg.db/sampledataicebergtable/metadata/\nold_s3_bucket_name_or_path = '\u003cname of S3 bucket or the S3 file path that you want to replace in the Iceberg metadata\u003e' # ex. glue-iceberg-from-jars-s3bucket-2ut1suuihxyq\nnew_s3_bucket_name_or_path = '\u003cwhen you find an instance of ^ what you want to replace it with IE. the name of the S3 bucket or file path the metadata was moved to\u003e' # ex. register-iceberg-2ut1suuihxyq\n```\n\nAfter updating these variables click on the **Save** and then **Run** button.\n\n\u003cimg width=\"800\" alt=\"quick_setup\" src=\"https://github.com/ev2900/Iceberg_update_metadata_script/blob/main/README/save_run.png\"\u003e\n\nIf you are running this script and updating the S3 references in the metadata.json and AVRO files with the intent of using the [register_table](https://github.com/ev2900/Iceberg_Glue_register_table) command.\n\nThe python script outputs the path of the latest metadata.json file for the Iceberg table. This can be directly input into the [register_table](https://github.com/ev2900/Iceberg_Glue_register_table) command.\n\nTo find this output access the Cloudwatch Output logs for the Glue job run.\n\n\u003cimg width=\"800\" alt=\"quick_setup\" src=\"https://github.com/ev2900/Iceberg_update_metadata_script/blob/main/README/output_logs_1.png\"\u003e\n\nIf you navigate to the end of the log stream you will see a log message that provides the file path you can use with [register_table](https://github.com/ev2900/Iceberg_Glue_register_table)\n\n\u003cimg width=\"800\" alt=\"quick_setup\" src=\"https://github.com/ev2900/Iceberg_update_metadata_script/blob/main/README/register_path_logs.png\"\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fev2900%2Ficeberg_update_metadata_script","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fev2900%2Ficeberg_update_metadata_script","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fev2900%2Ficeberg_update_metadata_script/lists"}