{"id":15208834,"url":"https://github.com/durlachert/delta-lake-optimization","last_synced_at":"2026-01-25T02:02:32.399Z","repository":{"id":251908918,"uuid":"838811298","full_name":"durlachert/delta-lake-optimization","owner":"durlachert","description":"BA 2","archived":false,"fork":false,"pushed_at":"2024-08-10T11:52:37.000Z","size":8734,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-02T01:31:40.155Z","etag":null,"topics":["apache-hive","apache-spark","big-data","delta-lake","hdfs","jupyter-notebook"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/durlachert.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-06T11:44:54.000Z","updated_at":"2024-08-10T11:52:40.000Z","dependencies_parsed_at":"2024-09-29T06:15:14.402Z","dependency_job_id":null,"html_url":"https://github.com/durlachert/delta-lake-optimization","commit_stats":{"total_commits":18,"total_committers":2,"mean_commits":9.0,"dds":0.4444444444444444,"last_synced_commit":"840746c18adb24208d08f265417001280be15559"},"previous_names":["durlachert/delta-lake-optimization"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/durlachert%2Fdelta-lake-optimization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/durlachert%2Fdelta-lake-optimization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/durlachert%2Fdelta-lake-optimization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/durlachert%2Fdelta-lake-optimization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/durlachert","download_url":"https://codeload.github.com/durlachert/delta-lake-optimization/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238825708,"owners_count":19537112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-hive","apache-spark","big-data","delta-lake","hdfs","jupyter-notebook"],"created_at":"2024-09-28T07:02:15.004Z","updated_at":"2025-10-29T12:31:31.263Z","avatar_url":"https://github.com/durlachert.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Optimizing Delta Lake Lakehouse Tables for Improved Performance and Scalability\n\n\n**Author:** Thomas Durlacher  \n**Date:** 10 August 2024\n\n## 1. Introduction\n\nThe project conducted an investigation of different Delta Lake optimization methods. The evaluation took place on a virtual Apache Spark cluster, providing insights into their efficiency, scalability, and compatibility within a distributed big data processing environment. To make the performance measurements comparable, one dataset was generated to create tables.\n\n\n## 2. Repository\n\nGitHub Repository: [https://github.com/durlachert/delta-lake-optimization/](https://github.com/durlachert/delta-lake-optimization)\n\n## 3. Objectives\n\n- Evaluate the performance impact of optimization techniques.\n- Assess the scalability of the solutions as the size of the dataset increases.\n- Analyze the query performance on complex analytical workloads.\n- Identify any notable advantages or limitations of each solution in the context of the project requirements.\n\n## 4. Tools and Technologies\n\n- Delta Lake\n- Apache Spark\n- Pyspark\n- Scala\n- Apache Kafka\n- Apache Hive\n- MySql\n- HDFS\n- Ubuntu\n- Jupyter Notebook\n- Apache Toree\n\n\n## 5. Methodology\n\n### Dataset Preparation:\n\nGenerate synthetic datasets of varying sizes to simulate real-world big data scenarios.\n\n### Cluster Setup:\n\nDeploy an Apache Spark and necessary dependencies for Delta Lake.\n\n### Data Ingestion:\n\nLoad datasets into tables using Delta Lake format.\n\n### Performance Metrics:\n\n- Measure the time taken for read and write operations.\n- Evaluate the scalability by gradually increasing the dataset size.\n- Execute complex analytical queries and measure query performance.\n\n### Observations and Analysis:\n\nDocument any challenges encountered during setup and configuration. Compare and contrast the performance metrics obtained.\n\n## 6. Expected Outcomes\n\n- A detailed report highlighting the strengths and weaknesses of Delta Lake optimization methods.\n- Insights into the performance characteristics of both solutions under varying workloads and dataset sizes.\n- Recommendations for selecting the appropriate solution based on specific use cases.\n- Description and visual representation of different performance measurements.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdurlachert%2Fdelta-lake-optimization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdurlachert%2Fdelta-lake-optimization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdurlachert%2Fdelta-lake-optimization/lists"}