{"id":19705544,"url":"https://github.com/treeverse/lakeview","last_synced_at":"2025-10-28T13:07:00.775Z","repository":{"id":40269749,"uuid":"302033316","full_name":"treeverse/lakeview","owner":"treeverse","description":"lakeview is a visibility tool for S3 based data lakes","archived":false,"fork":false,"pushed_at":"2023-05-23T00:12:49.000Z","size":232,"stargazers_count":30,"open_issues_count":3,"forks_count":4,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-04-16T10:58:31.187Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/treeverse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-07T12:44:29.000Z","updated_at":"2024-04-03T12:16:08.000Z","dependencies_parsed_at":"2022-09-20T20:46:27.661Z","dependency_job_id":null,"html_url":"https://github.com/treeverse/lakeview","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Flakeview","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Flakeview/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Flakeview/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/treeverse%2Flakeview/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/treeverse","download_url":"https://codeload.github.com/treeverse/lakeview/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224178298,"owners_count":17268862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T21:28:51.834Z","updated_at":"2025-10-28T13:07:00.708Z","avatar_url":"https://github.com/treeverse.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# lakeview\n\nlakeview is a visibility tool for AWS S3 based data lakes.\n\nThink of it as [ncdu](https://en.wikipedia.org/wiki/Ncdu), but for Petabyte-scale data, on S3.\n\nInstead of scanning billions of objects using the S3 API (which would require millions of API calls),\nlakeview uses [Athena](https://aws.amazon.com/athena/) to query [S3 Inventory Reports](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html).\n\n## What can it do?\n\n1. Aggregate the sizes of directories* in S3, allowing you to drill down and find what is taking up space.\n1. Compare sizes between different dates - see how directories size change over time between different inventory reports.\n1. _Planned but not yet implemented - _ find the largest duplicates in your directories.\n\n\n\\* _S3, being an object store and not a filesystem, doesn't really have a notion of directories, but its API supports so-called \"common prefixes\"._\n\nAll capabilities are provided in both a human consumable web interface and a machine consumable JSON report - feel free to plug them into your favorite monitoring tool.\n\n## What does it look like?\n\n#### Size report:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"du.png\"/\u003e\n\u003c/p\u003e\n\n\n#### Size diff:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"compare.png\"/\u003e\n\u003c/p\u003e\n\n\n## Quickstart\n\n1. Ensure you have an [S3 inventory set up](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html#storage-inventory-how-to-set-up) (preferably as Parquet or ORC)\n1. Verify the table is [registered in Athena](https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html#storage-inventory-athena-query)\n1. Run lakeview as a standalone Docker container:\n   \n   ```shell script\n   docker run -it -p 5000:5000 \\\n       -v $HOME/.aws:/home/lakeview/.aws \\\n       treeverse/lakeview \\\n           --table \u003cathena table name\u003e \\\n           --output-location \u003cs3 uri\u003e\n   ```\n   \n   note `\u003cathena table name\u003e` is the name you gave in step 2, and `\u003cs3 uri\u003e` is a location in S3 where Athena could store its results (e.g. `s3://my-bucket/athena/`)\n   \n1. Open [http://localhost:5000/](http://localhost:5000/) and start exploring\n\n## Using lakeview as an API\n\n\n### API endpoint: `/du`\n\nTo get results as JSON - add `Accept: application/json` to your request headers, or pass `json` as a query string parameter.\n\n#### Query Parameters: \n\n`prefix (default: \"\")` - return objects and directories[1] starting with the given prefix\n\n`delimiter (default: \"/\")` - use this character as delimiter to group objects under a common prefix\n\n`date` - date string corresponding to the inventory you'd like to query (YYYY-MM-DD-00-00) is S3's default structure\n\n`compare (optional)` - another date string. If present, lakeview will calculate a diff between the two reports for every common prefix and will sort the results based on the largest absolute diff\n\n#### Example\n\nRequest:\n\n```\nhttp://localhost:5000/du?prefix=\u0026delimiter=%2F\u0026date=2020-08-23-00-00\u0026compare=2020-08-22-00-00\u0026json\n```\n\nResponse:\n\n```json\n{\n  \"compare\": \"2020-08-22-00-00\",\n  \"date\": \"2020-08-23-00-00\",\n  \"delimiter\": \"/\",\n  \"prefix\": \"\",\n  \"response\": [\n    {\n      \"common_prefix\": \"users/\",\n      \"diff\": 3363690400953,\n      \"size_left\": 231203538669496,\n      \"size_right\": 231203538669496\n    },\n    {\n      \"common_prefix\": \"production/\",\n      \"diff\": 2737293183914,\n      \"size_left\": 6238586023266733,\n      \"size_right\": 6238586023266733\n    },\n    {\n      \"common_prefix\": \"staging/\",\n      \"diff\": 281953288549,\n      \"size_left\": 367219795944457,\n      \"size_right\": 367219795944457\n    },\n    ...\n  ]\n}\n\n```\n\n## Building and running locally\n\nClone the repo, and from the root directory run:\n\n```\n$ pip install -r requirements.txt\n```\n\nand run this:\n\n```\n$ python server.py \\\n      --table \u003cathena table name\u003e \\\n      --output-location \u003cs3 uri\u003e\n```\n\nFor a complete reference, run:\n\n```\n$ python server.py --help\n```\n\n## License\n\nlakeview is distributed under the Apache 2.0 license. See the included LICENSE file.\n\n\n## More information\n\nlakeview was originally built (with \u003c3) by [Treeverse](https://lakefs.io/).\n\nWe're actively developing [lakeFS](https://github.com/treeverse/lakeFS) as an open source tool that delivers resilience and manageability to object-storage based data lakes.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Flakeview","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftreeverse%2Flakeview","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftreeverse%2Flakeview/lists"}