{"id":14982353,"url":"https://github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer","last_synced_at":"2025-10-29T12:31:31.972Z","repository":{"id":239328108,"uuid":"788116267","full_name":"aws-samples/monitoring-apache-iceberg-table-metadata-layer","owner":"aws-samples","description":"Sample code to collect Apache Iceberg metrics for table monitoring","archived":false,"fork":false,"pushed_at":"2024-08-18T09:15:53.000Z","size":806,"stargazers_count":23,"open_issues_count":1,"forks_count":4,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-02-02T01:31:44.954Z","etag":null,"topics":["apache-iceberg","apache-spark","aws","aws-cloudwatch","aws-glue","aws-lambda","data-quality","monitoring","pyiceberg","sam-cli"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit-0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aws-samples.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-17T20:04:08.000Z","updated_at":"2024-12-31T17:24:12.000Z","dependencies_parsed_at":"2024-09-24T09:00:47.165Z","dependency_job_id":null,"html_url":"https://github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer","commit_stats":{"total_commits":6,"total_committers":3,"mean_commits":2.0,"dds":"0.33333333333333337","last_synced_commit":"088cb00af7a0825d2b717f5e26b3c2793628a5e5"},"previous_names":["aws-samples/monitoring-apache-iceberg-table-metadata-layer"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws-samples%2Fmonitoring-apache-iceberg-table-metadata-layer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws-samples%2Fmonitoring-apache-iceberg-table-metadata-layer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws-samples%2Fmonitoring-apache-iceberg-table-metadata-layer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aws-samples%2Fmonitoring-apache-iceberg-table-metadata-layer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aws-samples","download_url":"https://codeload.github.com/aws-samples/monitoring-apache-iceberg-table-metadata-layer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238825716,"owners_count":19537114,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-iceberg","apache-spark","aws","aws-cloudwatch","aws-glue","aws-lambda","data-quality","monitoring","pyiceberg","sam-cli"],"created_at":"2024-09-24T14:05:14.982Z","updated_at":"2025-10-29T12:31:31.567Z","avatar_url":"https://github.com/aws-samples.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Monitoring Apache Iceberg Table metadata layer using AWS Lambda, AWS Glue and AWS CloudWatch\n\nThis repository provides you with a sample solution that collects metrics of existing Apache Iceberg tables managed in your Amazon S3 and catalogued to AWS Glue Data Catalog. The solution consists of AWS Lambda deployment package that collects and submits metrics into AWS CloudWatch. Repository also includes helper script for deploying CloudWatch monitoring dashboard to visualize collected metrics.\n\n### Table of Contents\n- [Solution Tenets](#solution-tenets)\n- [Technical implementation](#technical-implementation)\n- [Metrics collected](#metrics-collected)\n- [Setup](#setup)\n    - [Prerequisites](#prerequisites)\n    - [Build and Deploy](#build-and-deploy)\n    - [Test Locally](#test-locally)\n- [Dependencies](#dependencies)\n- [Clean Up](#clean-up)\n- [Security](#security)\n- [License](#license)\n\n### Solution Tenets\n* Solution is designed to provide time-series metrics for Apache Iceberg to monitor Apache Iceberg tables over-time to recognize trends and anomalies. \n* Solution is designed to be lightweight and collect metrics exclusively from Apache Iceberg metadata layer without scanning the data layer hense without the need for heavy compute capacity.\n* In the future we strive to reduce the dependency on AWS Glue in favor of using AWS Lambda compute when required features are available in [PyIceberg](https://py.iceberg.apache.org) library.\n\n### Technical implementation\n\n![Architectural diagram of the solution](assets/arch.png)\n\n* Amazon EventBridge rule triggers AWS Lambda on every event of  *Glue Data Catalog Table State Change*. Event triggered every time transaction committed to Apache Iceberg Table.\n* Triggered AWS Lambda code aggregates information retrieved from metadata tables to create [metrics](#metrics-collected) and submits those to Amazon CloudWatch.\n* AWS Lambda code includes `pyiceberg` library and [AWS Glue interactive Sessions](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-overview.html) with minimal compute to read `snapshots`, `partitions` and `files` Apache Iceberg metadata tables with Apache Spark.\n\n\n### Metrics collected\n*Snapshot metrics*\n* snapshot.total_data_files\n* snapshot.added_data_files\n* snapshot.deleted_data_files\n* snapshot.total_delete_files\n* snapshot.added_records\n* snapshot.deleted_records\n* snapshot.added_files_size\n* snapshot.removed_files_size\n* snapshot.added_position_deletes\n\n*Partitions aggregated metrics*\n* partitions.avg_record_count\n* partitions.max_record_count\n* partitions.min_record_count\n* partitions.deviation_record_count\n* partitions.skew_record_count\n* partitions.avg_file_count\n* partitions.max_file_count\n* partitions.min_file_count\n* partitions.deviation_file_count\n* partitions.skew_file_count\n\n*Per-partition metrics*\n* partitions.file_count\n* partitions.record_count\n\n*Files aggregated metrics*\n* files.avg_record_count\n* files.max_record_count\n* files.min_record_count\n* files.avg_file_size\n* files.max_file_size\n* files.min_file_size\n\n## Setup\n\n### Prerequisites\n\n#### Install Docker\n\nThis solution uses Docker as a dependency for AWS SAM CLI.\nTo install Docker follow Docker official documentation.\nhttps://docs.docker.com/get-docker/\n\n#### Install SAM CLI\n\nThis solution is using AWS SAM CLI to build test and deploy AWS Lambda code that collects the Iceberg table metrics and submits them into AWS CloudWatch.\n\nTo install AWS SAM CLI follow AWS Documentation. \\\nhttps://docs.aws.amazon.com/serverless-application-model/latest/developerguide/install-sam-cli.html\n\n\n#### Configuring IAM permissions for AWS Glue\n\n- [Step 1: Create an IAM policy for the AWS Glue service](https://docs.aws.amazon.com/glue/latest/dg/create-service-policy.html)\n- [Step 2: Create an IAM role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html)\n\n### Build and Deploy\n\n\u003e ! Important - The guidance below uses AWS Serverless Application Model (SAM) for easier packaging and deployment of AWS Lambda. However if you use your own packaging tool or if you want to deploy AWS Lambda manually you can explore following files:\n\u003e - template.yaml\n\u003e - lambda/requirements.txt\n\u003e - lambda/app.py\n\n#### 1. Build AWS Lambda using AWS SAM CLI\n\nOnce you've installed [Docker](#install-docker) and [SAM CLI](#install-sam-cli) you are ready to build the AWS Lambda. Open your terminal and run command below.\n\n```bash\nsam build --use-container\n```\n\n#### 2. Deploy AWS Lambda using AWS SAM CLI\n\nOnce build is finished you can deploy your AWS Lambda. SAM will upload packaged code and deploy AWS Lambda resource using AWS CloudFormation. Run below command using your terminal.\n\n```bash\nsam deploy --guided\n```\n\n##### Parameters\n\n- `CWNamespace` - A namespace is a container for CloudWatch metrics.\n- `GlueServiceRole` - AWS Glue Role arn you created [earlier](#configuring-iam-permissions-for-aws-glue).\n- `Warehouse` - Required catalog property to determine the root path of the data warehouse on S3. This can be any path on your S3 bucket. Not critical for the solution.\n\n\n#### 3. Configure EventBridge Trigger\n\nIn this section you will configure EventBridge Rule that will trigger Lambda function on every transaction commit to Apache Iceberg table.\nDefault rule listens to `Glue Data Catalog Table State Change` event from all the tables in Glue Data Catalog catalog. Lambda code knows to skip non-iceberg tables.\nIf you want to scope triggers to specific Iceberg Tables and not collecting metrics from all of them you can uncomment `glue_table_names = [\"\u003c\u003cREPLACE TABLE 1\u003e\u003e\", \"\u003c\u003cREPLACE TABLE 1\u003e\u003e\"]` and add relevant table names.\n\n```python\nimport boto3\nimport json\n\n# Initialize a boto3 client\nsession = boto3.Session(region_name='\u003c\u003cSET CORRECT AWS REGION\u003e\u003e')\nlambda_client = session.client('lambda')\nevents_client = session.client('events')\n\n# Parameters\nlambda_function_arn = '\u003c\u003cREPLACE WITH LAMBDA FUNCTION ARN\u003e\u003e'\nglue_table_names = None\n# glue_table_names = [\"\u003c\u003cREPLACE TABLE 1\u003e\u003e\", \"\u003c\u003cREPLACE TABLE 1\u003e\u003e\"]\n\n# Create EventBridge Rule\nevent_pattern = {\n    \"source\": [\"aws.glue\"],\n    \"detail-type\": [\"Glue Data Catalog Table State Change\"]\n}\n\nif glue_table_names:\n    event_pattern\n    event_pattern[\"detail\"] = {\n        \"tableName\":  glue_table_names   \n    }\nevent_pattern_dump = json.dumps(event_pattern)\nrule_response = events_client.put_rule(\n    Name='IcebergTablesUpdateRule',\n    EventPattern=event_pattern_dump,\n    State='ENABLED'\n)\n# Add Lambda as a target to the EventBridge Rule\nevents_client.put_targets(\n    Rule='IcebergTablesUpdateRule',\n    Targets=[\n        {\n            'Id': '1',\n            'Arn': lambda_function_arn\n        }\n    ]\n)\nprint(f\"Pattern updated = {event_pattern_dump}\")\n```\n\n#### 4. (Optional) Create CloudWatch Dashboard\nOnce your Iceberg Table metrics are submitted to CloudWatch you can start using them to monitor and create alarms. CloudWatch also let you visualize metrics using CloudWatch Dashboards.\n\n`assets/cloudwatch-dashboard.template.json` is a sample CloudWatch dashboard configuration that uses fraction of the submitted metrics and combines it with AWS Glue native metrics for Apache Iceberg. \nWe use Jinja2 so you could generate your own dashboard by providing your parameters.\n\n\n![CloudWatch Dashboard Screenshot](assets/cw-dashboard-screenshot.png)\n\nRun the script below to generate your own CloudWatch dashboard configuration.\nReplace input values with the relevant [parameters](#parameters) from previous sections.\n\n```python\nimport json\nfrom jinja2 import Template\n\ndef render_json_template(template_path, data):\n    with open(template_path, 'r') as file:\n        template_text = file.read()\n\n    template = Template(template_text)\n    rendered_json = template.render(data)\n    json_data = json.loads(rendered_json)\n    return json_data\n\n# Data to fill in the template\ndata = {\n    \"CW_NAMESPACE\": \"\u003c\u003cREPLACE\u003e\u003e\",\n    \"REGION\": \"\u003c\u003cREPLACE\u003e\u003e\",\n    \"DBNAME\": \"\u003c\u003cREPLACE\u003e\u003e\",\n    \"TABLENAME\": \"\u003c\u003cREPLACE\u003e\u003e\"\n}\n\n# Path to cloudwatch template file\ntemplate_path = 'assets/cloudwatch-dashboard.template.json'\nrendered_data = render_json_template(template_path, data)\noutput_path = 'assets/cloudwatch-dashboard.rendered.json'\n\nwith open(output_path, 'w') as file:\n        json.dump(rendered_data, file, indent=4)\n\nprint(f\"Your dashboard configuration successfully generated at {output_path}\")\n```\n\nNow follow steps to create CloudWatch dashboard from rendered json.\n\n1. Sign in to the AWS Management Console and navigate to the CloudWatch service.\n2. In the navigation pane, click on \"Dashboards\" on the left pane.\n3. Click on \"Create Dashboard\" and give it a name. \n4. If widget configuration popup appears click \"Cancel\".\n5. Click the \"Actions\" dropdown menu in the top right corner of the dashboard and select \"View/edit source\".\nThis will open a new tab with the source JSON for the dashboard. You can then paste rendered JSON into a Dashboard source to create a custom dashboard resource.\n6. Click \"Update\"\n7. The new dashboard supposedly empty. Once your AWS Lambda will generate metrics they will appear here.\n\n### Test Locally\n\nYou can test the code locally on using SAM CLI.\nEnsure you have configured the [right AWS permissions](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) to call CloudWatch and AWS Glue.\n\n```bash\nsam local invoke IcebergMetricsLambda --env-vars .env.local.json\n```\n\n`.env.local.json` - The JSON file that contains values for the Lambda function's environment variables. Lambda code is dependent on env vars that you are passing in the deploy section. You need to create the file it and include relevant [parameters](#parameters) before you calling `sam local invoke`.\n\n\n## Dependencies\n\nPyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. \\\nhttps://py.iceberg.apache.org\n\nAWS Serverless Application Model (AWS SAM) \\\nhttps://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html\n\nDocker \\\nhttps://docs.docker.com/get-docker/\n\n## Clean Up\n\n1. Delete AWS Lambda `sam delete`.\n2. Delete CloudWatch Dashboard.\n3. Delete EventBridge rule.\n\n## Security\n\nSee [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.\n\n## License\n\nThis library is licensed under the MIT-0 License. See the LICENSE file.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faws-samples%2Fmonitoring-apache-iceberg-table-metadata-layer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faws-samples%2Fmonitoring-apache-iceberg-table-metadata-layer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faws-samples%2Fmonitoring-apache-iceberg-table-metadata-layer/lists"}