{"id":19705545,"url":"https://github.com/treeverse/cloud-sample-repo-hooks","last_synced_at":"2026-04-15T00:32:17.070Z","repository":{"id":143537216,"uuid":"614224699","full_name":"treeverse/cloud-sample-repo-hooks","owner":"treeverse","description":null,"archived":false,"fork":false,"pushed_at":"2023-04-19T07:44:30.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-02-27T17:31:35.921Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/treeverse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-15T06:36:55.000Z","updated_at":"2023-03-15T06:38:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"597db7c8-09d8-4409-9d9c-8590bf396426","html_url":"https://github.com/treeverse/cloud-sample-repo-hooks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/treeverse/cloud-sample-repo-hooks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fcloud-sample-repo-hooks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fcloud-sample-repo-hooks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fcloud-sample-repo-hooks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fcloud-sample-repo-hooks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/treeverse","download_url":"https://codeload.github.com/treeverse/cloud-sample-repo-hooks/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Fcloud-sample-repo-hooks/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31821509,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T18:05:02.291Z","status":"ssl_error","status_checked_at":"2026-04-14T18:05:01.765Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T21:28:52.045Z","updated_at":"2026-04-15T00:32:17.042Z","avatar_url":"https://github.com/treeverse.png","language":"Lua","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Welcome to lakeFS Cloud Sample Repository\n\n## What is lakeFS?\nlakeFS is an open source data version control system for data lakes.\nIt enables zero copy Dev / Test isolated environments, continuous quality validation, atomic rollback on bad data, reproducibility, and more.\n\n## Introduction\nWelcome to lakeFS sample-repo!\nWe've included step-by-step instructions, [pre-loaded data](#data-sets-examples) sets and [hooks](https://docs.lakefs.io/hooks/overview.html) to get familiar with lakeFS [versioning model](https://docs.lakefs.io/understand/model.html) and its capabilities.\n\nWe'll start by going over [lakeFS basic capabilities](#getting-started), such as creating a branch, uploading an object and committing that object.\n\nWe also included instructions on how to use [lakeFS Hooks](#diving-into-hooks), which demonstrates how to govern the data you merge into your main branch, for instance, making sure no PII is presented on the main branch, and that every commit to main includes certain metadata attributes.\n\n## Getting Started\n\n\u003e **_NOTE:_** The hooks example below can be done by using the CLI or the UI, if you'd like to use the CLI, make sure to have [lakectl](https://docs.lakefs.io/reference/commands.html#configuring-credentials-and-api-endpoint) and [spark s3a](https://docs.lakefs.io/integrations/spark.html#access-lakefs-using-the-s3a-gateway) configured correctly.\n\nWe'll start by covering lakeFS basics.\n\nLet's start by creating a branch:\n```sh\n# CLI\n$ lakectl branch create lakefs://sample-repo/my-branch -s lakefs://sample-repo/main\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Branches\" -\u003e click \"Create Branch\" -\u003e fill in \"my-branch\" for Branch Name -\u003e click \"Create\".\n```\n\nGreat! you've created your first branch, you should now see it in the list of branches!\n\nNow let's try uploading an object to the `my-branch` branch. Choose or create a local file in any format to test the upload and use its path in place of `/path/to/some/file` below: \n\n```sh\n# CLI\n$ lakectl fs upload lakefs://sample-repo/my-branch/file -s /path/to/some/file\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Objects\" -\u003e pick \"my-branch\" from the branch drop down -\u003e Click \"Upload Object\" -\u003e Click \"Choose file\" and pick a file to upload -\u003e click \"Upload\".\n```\n\nNow that we've uploaded the file, first, you'll see it in the stage area (uncommitted):\n```sh\n# CLI\nlakectl diff lakefs://sample-repo/my-branch\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Objects\" -\u003e pick \"my-branch\" from the branch drop down -\u003e click \"Uncommitted changes\".\n```\n\nLet's commit the file:\n```sh\n# CLI\nlakectl commit lakefs://sample-repo/my-branch\n\n# UI\nStill within the \"my-branch\" Uncommitted Changes -\u003e click \"Commit Changes\" -\u003e click once again \"Commit Changes\".\n```\n\nLet's explore some data \n\u003e **_NOTE:_** for this example we'll demonstrate how to query parquet files using `DuckDB` from within the UI.\n\n* Within the \"sample-repo\" repository, pick the \"main\" branch from the drop down\n* Click the \"world-cities-database-population\" directory, and the \"raw\" directory within that\n* Click the \"part-00000-tid-1091049596617008918-5f8b8e42-730c-4cc2-ba06-3e5f4a4acff6-22194-1-c000.snappy.parquet\" parquet file.\n\nNow you should see the parquet file with a standard SQL query displaying the parquet file as table, with its columns.\n\nLet's try to get some insights from this parquet, let's try to find out how many people live in the biggest city in each country. Replace the SQL query with the one below and click \"Execute\":\n\n```sql\nSELECT \n  country_name_en, max(population) AS biggest_city_pop\nFROM\n  read_parquet(lakefs_object('sample-repo', 'main', 'world-cities-database-population/raw/part-00000-tid-1091049596617008918-5f8b8e42-730c-4cc2-ba06-3e5f4a4acff6-22194-1-c000.snappy.parquet')) \nGROUP BY\n country_name_en\nORDER BY\n   biggest_city_pop DESC\n```\n\nThat was cool, wasn't it?\n\n## Diving Into Hooks\n\nLet's start by trying our first hook, which will ensure that certain metadata is present for any commit to the `main` branch. \n\nUpload a file (you can use the same one as above):\n\n```sh\n# CLI\n$ lakectl fs upload lakefs://sample-repo/main/test -s /path/to/some/file\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Upload object\" -\u003e click \"Choose file\" -\u003e pick a file from your filesystem -\u003e click \"Upload\".\n```\n\nNow that we've uploaded the file, first, you'll see it in the stage area (uncommitted):\n```sh\n# CLI\nlakectl diff lakefs://sample-repo/main\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Uncommitted changes\".\n```\n\nGreat! now let's try to commit that file:\n```sh\n# CLI\nlakectl commit lakefs://sample-repo/main --message \"Test Commit\"\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Uncommitted changes\" -\u003e click \"Commit Changes\" -\u003e click \"Commit Changes\".\n```\n\nOuch! we were caught in the act of trying to commit to `main` without the required attributes `owner` and `environment`! \n\n```sh\nBranch: lakefs://sample-repo/main\npre-commit hook aborted, run id '5kepqvj1nilti6cut9hg': 1 error occurred:\n\t* hook run id '0000_0000' failed on action 'pre commit metadata field check' hook 'check_commit_metadata': runtime error: [string \"lua\"]:7: missing mandatory metadata field: owner\n\n\n412 Precondition Failed\n```\n\nLet's retry that action, this time, with the required attributes:\n\n```sh\n# CLI\nlakectl commit lakefs://sample-repo/main --message \"Test Commit\" --meta owner=\"John Doe\",environment=\"production\"\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Uncommitted changes\" -\u003e click \"Commit Changes\" -\u003e click \"+ Add Metadata field\" -\u003e insert key: \"owner\", value: \"John Doe\" -\u003e click \"+ Add Metadata field\" -\u003e insert key: \"environment\", value: \"production\" -\u003e click \"Commit Changes\".\n```\n\nCongrats! you've just made the first commit using the `pre-commit metadata validator` hook!\n\nLet's jump on to a more advanced example.\n\nNow, we'll try to sneak some private information into the main branch!\n\nWe've created a sneaky-branch for you to explore the second hook, we'll try merging a branch that contains a dangerous file (a file with the `email` column within a parquet file)\n\nLet's try to merge that branch into main:\n```sh\n# CLI\nlakectl merge lakefs://sample-repo/emails lakefs://sample-repo/main\n\n# UI\nWithin the \"sample-repo\" repository -\u003e click \"Compare\" tab -\u003e pick the \"main\" as base branch -\u003e Pick \"emails\" as compared to branch -\u003e Click \"Merge\".\n```\n\nYou should see the following output:\n```sh\nupdate branch main: pre-merge hook aborted, run id '5kepi1b1nilh6brjhmmg': 1 error occurred: * hook run id '0000_0000' failed on action 'pre merge PII check on main' hook 'check_blocked_pii_columns': runtime error: [string \"lua\"]:37: Column is not allowed: 'email': type: BYTE_ARRAY in path: tables/customers/dangerous.parquet : Error: update branch main: pre-merge hook aborted, run id '5kepi1b1nilh6brjhmmg': 1 error occurred: * hook run id '0000_0000' failed on action 'pre merge PII check on main' hook 'check_blocked_pii_columns': runtime error: [string \"lua\"]:37: Column is not allowed: 'email': type: BYTE_ARRAY in path: tables/customers/dangerous.parquet at tz.merge\n```\n\nPhew! we dodged a bullet here, no PII is present on our main branch.\n\nThat's all for our hooks demonstration, if you're interested in understanding more about hooks, [read our docs](https://docs.lakefs.io/hooks/).\n\n## Sample Data\n\nFor your convenience, we've created a first repository with some sample data:\n\n* [world-cities-database-population](https://www.kaggle.com/datasets/arslanali4343/world-cities-database-population-oct2022) - which contains information on the different cities and population (Licensed: Database Contents License (DbCL) v1.0)\n\n* [nyc-tlc-trip-data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) - which contains information on New York City Yellow and Green taxi trip records.\n\nWe've also included a couple of hooks to help you get started:\n* [pre-commit metdata-validation hook](./_lakefs_actions/pre-commit-metadata-validation.yaml) - which will verify on each commit to `stage` and `main` branches, that the following metadata attributes are present: `owner` (free text) and `environment` (must be one of \"production\", \"staging\" or \"development\").\n* [pre-merge format-validation hook](./_lakefs_actions/pre-merge-format-validation.yaml) - which will verify on each merge to the `main` branch, that the following PII (Personal Identifiable Information) columns are **missing** within the `tables/customers/` and `tables/orders/` locations.\n\n## Data Sets Examples\n\nAs mentioned above, we've included a couple of datasets for you to experience lakeFS with, here are some examples to get you started:\n\n```sh\ntrips_df = spark.read.parquet(\"s3a://sample-repo/main/nyc-tlc-trip-data/yellow_tripdata_2022-11.parquet\")\n\ntrips_df.printSchema()\n\ntrips_df.registerTempTable(\"yellow_trips\")\n\nquery = \"\"\"\nSELECT \n    VendorID,\n    tpep_pickup_datetime,\n    tpep_dropoff_datetime,\n    passenger_count,\n    trip_distance,\n    RatecodeID,\n    payment_type,\n    extra,\n    mta_tax,\n    tip_amount,\n    tolls_amount,\n    improvement_surcharge,\n    total_amount,\n    airport_fee\nFROM \n    yellow_trips\n\"\"\"\n\n# create a new DataFrame based on the query\ncombo_df = spark.sql(query)\n\n# Register the new DataFrame so that we can do EDA\ncombo_df.registerTempTable(\"combo\")\n\n# Let's start by exploring the data\ncombo_df.select(\"total_amount\").describe().toPandas()\n\ncombo_df.select(\"passenger_count\").describe().toPandas()\n```\n\n## lakeFS Cheatsheet\n\n\u003e **_NOTE:_** All lakectl commands described below, can be performed using our WebUI \n\n```sh\n# Create Branch\nlakectl branch create lakefs://my-repo/feature --source lakefs://my-repo/main\n\n# Reading data via Spark into DataFrame\ndata = spark.read.parquet(\"s3a://my-repo/feature/sample_data/release=v1.12/type=relation/20220411_183014_00011_baakr_1aee3559-eec4-4d3c-8895-4b36d965a431\").limit(20)\n\n# View the DataFrame\ndata.show()\n\n# Data Partitioning based on the version column and write it to the `feature` branch\ndata.write.partitionBy(\"version\").parquet(\"s3a://my-repo/feature/sample_data/by_version\")\n\n# List files in the `feature` branch\nlakectl fs ls lakefs://my-repo/feature/sample_data/\n\n# Running diff between two branches\nlakectl diff --two-way lakefs://my-repo/feature lakefs://my-repo/main\n```\n\nFor more more information and subcommands, go to [our docs](https://docs.lakefs.io/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Fcloud-sample-repo-hooks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftreeverse%2Fcloud-sample-repo-hooks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Fcloud-sample-repo-hooks/lists"}