{"id":19705536,"url":"https://github.com/treeverse/lakefs-hooks","last_synced_at":"2025-10-23T16:57:52.717Z","repository":{"id":45037584,"uuid":"340130126","full_name":"treeverse/lakeFS-hooks","owner":"treeverse","description":"a simple lakeFS webhook for pre-commit and pre-merge validation of data objects","archived":false,"fork":false,"pushed_at":"2023-11-09T23:07:11.000Z","size":55,"stargazers_count":11,"open_issues_count":1,"forks_count":0,"subscribers_count":9,"default_branch":"main","last_synced_at":"2024-04-16T10:58:31.205Z","etag":null,"topics":["data-engineering","data-lake","lakefs"],"latest_commit_sha":null,"homepage":"https://www.lakefs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/treeverse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-02-18T17:45:10.000Z","updated_at":"2024-03-16T15:52:51.000Z","dependencies_parsed_at":"2023-02-09T02:45:15.581Z","dependency_job_id":null,"html_url":"https://github.com/treeverse/lakeFS-hooks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2FlakeFS-hooks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2FlakeFS-hooks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2FlakeFS-hooks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2FlakeFS-hooks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/treeverse","download_url":"https://codeload.github.com/treeverse/lakeFS-hooks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224178267,"owners_count":17268852,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-lake","lakefs"],"created_at":"2024-11-11T21:28:48.858Z","updated_at":"2025-10-23T16:57:47.686Z","avatar_url":"https://github.com/treeverse.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lakeFS-hooks\n\nThis repository provides a set of simple [lakeFS](https://www.lakefs.io/) webhooks for pre-commit and pre-merge validation of data objects.\n\nBy setting these rules, a lakeFS-based data lake can ensure that production branches only ever contain valid, quality data - but still allows others to experiment with untested version on isolated branches.\n\n## Table of Contents\n\n- [What's included](#whats-included)\n- [Installation](#installation)\n- [Building a Docker image](#building-a-docker-image)\n- [Running a Server Locally](#running-a-server-locally)\n- [Usage](#usage)\n- [Support](#support)\n- [Community](#community)\n\n\n## What's included\n\nThis project contains a few basic building blocks that should make building custom lakeFS pre-merge/pre-commit hooks easier:\n\n1. A very terse lakeFS Python client that provides basic reading and diffing functions\n1. A simple, naive, read-only PyArrow FileSystem implementation for reading data objects from lakeFS.\n   This allows using PyArrow to read Parquet, ORC and other formats using PyArrow - to inspect their metadata or to construct queryable tables for testing and validation\n1. A set of reusable webhooks that could be used for common CI requirements (see below)\n1. A Dockerfile to containerize a webhook server for deployment\n\n### Included Webhooks\n\n#### File Format Validator\n\nThis webhook checks new files to ensure they are of a set of allowed data format. Could be scoped to a certain prefix.\n\nExample usage as a pre-merge hook in lakeFS:\n\n```yaml\n---\nname: ParquetOnlyInProduction\ndescription: This webhook ensures that only parquet files are written under production/\non:\n  pre-merge:\n    branches:\n      - master\nhooks:\n  - id: production_format_validator\n    type: webhook\n    description: Validate file formats\n    properties:\n      url: \"http://\u003chost:port\u003e/webhooks/format\"\n      query_params:\n        allow: [\"parquet\", \"delta_lake\"]\n        prefix: production/\n```\n\n#### Basic File Schema Validator\n\nThis webhook reads new Parquet and ORC files to ensure they don't contain a block list of column names (or name prefixes).\nThis is useful when we want to avoid accidental PII exposure.\n\nExample usage as a pre-merge hook in lakeFS:\n\n```yaml\n---\nname: NoUserColumnsUnderPub\ndescription: \u003e-\n  This webhook ensures that files with columns \n  beginning with \"user_\" can't be written to public/ \non:\n  pre-merge:\n    branches:\n      - master\nhooks:\n  - id: pub_prevent_user_columns\n    type: webhook\n    description: Ensure no user_* columns under public/\n    properties:\n      url: \"http://\u003chost:port\u003e/webhooks/schema\"\n      query_params:\n        disallow: [\"user_\", \"private_\"]\n        prefix: public/\n```\n\n#### Partition Dirty Checker\n\nIn certain cases, we want to ensure partitions (or directories) are completely immutable.\nThis means we allow writing to a directory only if:\n   - we overwrite all the files in it\n   - we add files but also delete all previous content\n\nIn this case, if files were added or replaced, but some previous content remains, we consider it \"dirty\" and fail the commit.\nThis check is smart enough to disregard empty files. \n\nExample usage as a pre-commit hook in lakeFS:\n\n```yaml\n---\nname: NoDirtyPartitionsInProduction\ndescription: Check all partitions remain immutable under tables/hive/\non:\n  pre-commit:\n    branches:\n      - \"*\"\nhooks:\n  - id: hive_ensure_immutable\n    type: webhook\n    description: Check all hive partitions are either fully written or fully replaced\n    properties:\n      url: \"http://\u003chost:port\u003e/webhooks/dirty_check\"\n      query_params:\n        prefix: tables/hive/\n```\n\n#### Commit Metadata Validator\n\nIn production, we want to ensure commits carry enough metadata to be useful for lineage and traceability.\n\nExample usage as a pre-commit hook in lakeFS:\n\n```yaml\n---\nname: EnsureProductionCommitMetadata\ndescription: \u003e-\n  Check commits that write to production/ that \n  they contain a set of mandatory metadata fields.\n  These fields must not be empty.\non:\n  pre-commit:\n    branches:\n      - \"*\"\nhooks:\n  - id: production_ensure_commit_metadata\n    type: webhook\n    description: Check all commits that write to production/ for mandatory metadata fields\n    properties:\n      url: \"http://\u003chost:port\u003e/webhooks/commit_metadata\"\n      query_params:\n        prefix: production/\n        fields: [airflow_dag_run_url, job_git_commit, update_sla, sources]\n```\n\n\n## Installation\n\nTo get started, clone this repo locally, as you might also want to modify it to your needs:\n\n```sh\n$ git clone https://github.com/treeverse/lakeFS-hooks.git\n$ cd lakeFS-hooks/\n# edit server.py\n```\n\n## Building a Docker image\n\nTo build a docker image, run the following command:\n\n```sh\n$ docker build -t lakefs-hooks:latest .\n# optionally, tag it and push it to a repository for deployment\n```\n\n## Running a server locally\n\n```sh\n# You should probably be using something like virtualenv/pipenv\n$ pip install -r requirements.txt\n$ export LAKEFS_SERVER_ADDRESS=\"http://lakefs.example.com\"\n$ export LAKEFS_ACCESS_KEY_ID=\"\u003caccess key ID of a lakeFS user\u003e\"\n$ export LAKEFS_SECRET_ACCESS_KEY=\"\u003csecret access key for the give key ID\u003e\"\n$ flask run\n```\n\nYou can now test it by passing an example pre-merge event using cURL:\n\n```sh\ncurl -v -XPOST -H 'Content-Type: application/json' \\\n  -d'{\n       \"event_type\": \"pre-merge\",\n       \"event_time\": \"2021-02-17T11:04:18Z\",\n       \"action_name\": \"test action\",\n       \"hook_id\": \"hook_id\",\n       \"repository_id\": \"my-lakefs-repository\",\n       \"branch_id\": \"main\",\n       \"source_ref\": \"220158b4b316e536e024aaaaf76b2377a6c71dfd6b974ca3a49354a9bdd0dbc3\",\n       \"commit_message\": \"a commit message\",\n       \"committer\": \"user1\"\n  }' 'http://localhost:5000/webhooks/schema'\n```\n\n\n## Running the Docker Container\n\nSee [Building a Docker Image](#building-a-docker-image) above for build instructions.\n\nTo run the resulting image using the Docker command line:\n\n```shell\n$ docker run \\\n    -e LAKEFS_SERVER_ADDRESS='http://lakefs.example.com' \\\n    -e LAKEFS_ACCESS_KEY_ID='\u003caccess key ID of a lakeFS user\u003e' \\\n    -e LAKEFS_SECRET_ACCESS_KEY='\u003csecret access key for the give key ID\u003e' \\\n    -p 5000:5000 \\\n    lakefs-hooks\n```\n\n## Support\n\nPlease [open an issue](https://github.com/treeverse/lakeFS-Flask-Webhook/issues/new) for support or contributions.\n\nFor more information on [lakeFS](https://www.lakefs.io/), please see the [official lakeFS documentation](https://docs.lakefs.io/).\n\n## Community\n\nStay up to date and get lakeFS support via:\n\n- [Slack](https://join.slack.com/t/lakefs/shared_invite/zt-ks1fwp0w-bgD9PIekW86WF25nE_8_tw) (to get help from our team and other users).\n- [Mastodon](https://data-folks.masto.host/@lakeFS) (follow for open conversation on mastodon data-folks and updates)\n- [Twitter](https://twitter.com/lakeFS) (follow for updates and news)\n- [YouTube](https://www.youtube.com/channel/UCZiDUd28ex47BTLuehb1qSA) (learn from video tutorials)\n- [Contact us](https://lakefs.io/contact-us/) (for anything)\n\n## Licensing\n\nlakeFS-hooks is completely free and open source and licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Flakefs-hooks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftreeverse%2Flakefs-hooks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Flakefs-hooks/lists"}