{"id":13707300,"url":"https://github.com/cloudandthings/cat-lakefs-demo","last_synced_at":"2025-06-30T11:03:50.344Z","repository":{"id":114903935,"uuid":"477668678","full_name":"cloudandthings/cat-lakefs-demo","owner":"cloudandthings","description":null,"archived":false,"fork":false,"pushed_at":"2023-03-06T14:35:35.000Z","size":22,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"develop","last_synced_at":"2025-02-22T06:41:24.827Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cloudandthings.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-04-04T11:22:22.000Z","updated_at":"2024-06-14T17:45:59.000Z","dependencies_parsed_at":"2024-01-14T20:34:33.193Z","dependency_job_id":null,"html_url":"https://github.com/cloudandthings/cat-lakefs-demo","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cloudandthings/cat-lakefs-demo","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudandthings%2Fcat-lakefs-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudandthings%2Fcat-lakefs-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudandthings%2Fcat-lakefs-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudandthings%2Fcat-lakefs-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cloudandthings","download_url":"https://codeload.github.com/cloudandthings/cat-lakefs-demo/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cloudandthings%2Fcat-lakefs-demo/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262762431,"owners_count":23360326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T22:01:26.642Z","updated_at":"2025-06-30T11:03:50.299Z","avatar_url":"https://github.com/cloudandthings.png","language":"HCL","funding_links":[],"categories":["HCL"],"sub_categories":[],"readme":"# Overview\n\nThis repo can be used to demonstrate basic use of LakeFS in AWS. \n\nIt provides a test AWS environment that can be spun up or down using Terraform, primarily this includes:\n+ LakeFS EC2 server\n+ LakeFS RDS database\n+ Various Glue jobs for testing / demonstrating features of LakeFS\n\nSee our Medium write-up here for more info:\nhttps://medium.com/cloudandthings/fun-with-lakefs-1237f73d4a1c\n\n\n# Usage\n## Configure Terraform\n\n1. Configure backend.\nCopy `backend_example.tf` to `backend.tf` and update it.\n\n2. Configure variables,\nCopy `terraform_example.tfvars` to `terraform.tfvars` and update it.\n\n### Create manual infrastructure\nThis repo does not create all required infrastructure, for example VPC and subnets. These would need to be created manually and then configured in `terraform.tfvars`. There's no reason for this other than I already had a VPC created, so this could be removed / automated in a future PR.\n\nAs you review `terraform.tfvars` it should be evident what is required to be created manually.\n\n## Create infrastructure\n\n- `terraform apply`\n\nTerraform will display the public IP address of the LakeFS Server.\n\nRemember to run `terraform destroy` later once you are done.\n\nThe LakeFS web UI will be accessible via the server EC2 instance public IP, which is output by Terraform. Connect on port 8000 using a web browser.\n\n## Set up LakeFS admin user\n\nThis step is only necessary when a new RDS instance is created.\n- Access lakefs via the LakeFS web UI, and you should be routed to the `setup` page.\n- Create an admin user on the LakeFS UI.\n- Download the LakeFS admin user credentials.\n\nAt this point it is recommended to take an RDS snapshot, for example `lakefs-init`. \n\n## Create a LakeFS repo\n\nUse the UI to create a repo and point it to a new bucket, or an existing bucket with a new prefix.\n\nRecommended settings:\n```\nBucket = s3://\u003cLAKEFS BUCKET\u003e/lakefs/\nRepo = main-repo\nBranch = main\n```\n\nTODO:\nCould be done using a lakefs client as follows:\n\n```\nlakectl repo create lakefs://source-data s3://treeverse-demo-lakefs-storage-production/user_m1eo6o342cajzqaa/source-data\n```\nOr probably using the API.\n\nFinally create another RDS snapshot, for example `lakefs-init-with-repo`\n\n## Configure snapshot (optional)\n\nWith the RDS snapshots, we can update `terraform.tfvars` with the following: \n- RDS snapshot ID: `rds_snapshot_id`\n\nIf we run `terraform destroy` to trash our environment, then next time LakeFS will be recreated with the RDS snapshot specified - preserving credentials and optionally the repo also. \n\nWe can re-run `terraform apply` to ensure changes take effect in any dependent resources.\n\n## Set up IAM user\n\nAn IAM user credential is used, only when accessing S3 when Glue reads/writes directly to S3 using the LakeFS filesystem format. \n\nManually create an IAM user called `lakefs`, and attach a policy to the user which allows `s3:*` to the LakeFS data bucket created by Terraform.\n\n## Store secret in AWS Secrets Manager\n\nCreate a secret in AWS Secrets Manager called `lakefs` and populate the following key-value pairs:\n\n| Key | Value |\n| -- | -- |\n| iam-user-access-key| IAM user access key |\n| iam-user-secret-key | IAM user secret key |\n| lakefs-access-key | LakeFS admin user access key |\n| lakefs-secret-key | LakeFS admin user secret key |\n\nThe LakeFS Glue jobs illustrate 2 different modes of accessing LakeFS.\n\nIf LakeFS uses its built-in S3 gateway then data is sent via the gateway (i.e via the LakeFS server) and therefore no IAM permission is needed. However the EC2 instance attached IAM role will then need S3 permissions.\n\n## Set up lakefs client\n\nWe could create a second EC2 instance and install LakeFS on it to test a CLI client. But we decided to rather focus effort on tests using Spark / Glue.\n\n# Use LakeFS to ingest data\n\n## Ingesting data using Spark\n\nSpark can be used to read and write data to LakeFS.\n\nSee the scripts directory for some sample Glue jobs.\n\nThe Glue jobs each test 2 methods of accessing data; firstly via the Spark dataframe API to access data on the underlying storage (excluding catalog) and secondly using the Glue catalog to access the data.\n\nIn addition, the multiple Glue jobs fulfil different functions.\nSee the comment at the top of each one for details.\n\n## Ingesting data with the LakeFS ingest tool\n\nTODO - not tested.\nIngesting sample data from https://registry.opendata.aws/speedtest-global-performance/\n```\nlakectl ingest \\\n  --from s3://ookla-open-data/parquet/performance/type=mobile/ \\\n  --to lakefs://source-data/main/performance/type=mobile/\n\nlakectl commit lakefs://source-data/main -m 'source data loaded'\n```\n\nTODO - Glue catalogue\n\nAfter the data has been ingested and catalogued, we can consider setting up `lakectl metastore` to also use the Glue catalog.\n\n# Known issues\n\n- Various TODOs in the code.\n- This demo environment is insecure for various reasons including those listed below. Do not put any sensitive data on this LakeFS environment.\n- - Postgres master user/password is used in LakeFS configuration file instead of a lakefs-specific user. This could be fixed perhaps with the Terraform PostgreSQL provider.\n- - The EC2 instance is open to the world so that Glue can talk to it. We could in future look at Glue endpoints in a VPC.\n\n- The Objects view by default displays objects in the current workspace, which includes uncommitted objects. You can switch to see objects at the last commit, but this is unintuitive.\n\n- A LakeFS Terraform provider might be be nice, perhaps for creating the LakeFS admin user credentials and storing them in AWS Secrets Manager automatically. There might be security issues with this though - would the credentials be stored in the TF state?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloudandthings%2Fcat-lakefs-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcloudandthings%2Fcat-lakefs-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcloudandthings%2Fcat-lakefs-demo/lists"}