{"id":18810385,"url":"https://github.com/absaoss/spark-partition-sizing","last_synced_at":"2025-04-13T20:31:01.912Z","repository":{"id":41851255,"uuid":"424418460","full_name":"AbsaOSS/spark-partition-sizing","owner":"AbsaOSS","description":"Sizing partitions in Spark","archived":false,"fork":false,"pushed_at":"2023-05-05T18:10:26.000Z","size":113,"stargazers_count":9,"open_issues_count":5,"forks_count":2,"subscribers_count":15,"default_branch":"master","last_synced_at":"2024-04-12T07:05:56.155Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AbsaOSS.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null}},"created_at":"2021-11-04T00:08:41.000Z","updated_at":"2023-08-20T07:39:10.000Z","dependencies_parsed_at":"2023-02-10T09:01:28.949Z","dependency_job_id":null,"html_url":"https://github.com/AbsaOSS/spark-partition-sizing","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-partition-sizing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-partition-sizing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-partition-sizing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AbsaOSS%2Fspark-partition-sizing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AbsaOSS","download_url":"https://codeload.github.com/AbsaOSS/spark-partition-sizing/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223603268,"owners_count":17172072,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T23:20:02.712Z","updated_at":"2024-11-07T23:20:03.304Z","avatar_url":"https://github.com/AbsaOSS.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spark-partition-sizing\n\n[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)\n[![Build](https://github.com/AbsaOSS/spark-partition-sizing/workflows/Build/badge.svg)](https://github.com/AbsaOSS/spark-partition-sizing/actions)\n[![Release](https://github.com/AbsaOSS/spark-partition-sizing/actions/workflows/release.yml/badge.svg)](https://github.com/AbsaOSS/spark-partition-sizing/actions/workflows/release.yml)\n\nLibrary for controlling the size of partitions when writing using Spark.\n\n## Motivation\nSometimes partitions written by Spark are quite unequal(data skew), which makes further reading potentially problematic and processing inefficient.\n`spark-partition-sizing` aims to reduce this problem by providing a number of utilities for achieving a more balanced partitioning.\n\n## Usage\nThis library is build in a Spark-specific manner, so individual Spark versions are part of library name. \n\n|              | spark-partition-sizing-spark2.4                                                                                                                                                                                                    | spark-partition-sizing-spark3.2                                                                                                                                                                                                  | spark-partition-sizing-spark3.3                                                                                                                                                                                                  |\n|--------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| _Scala 2.11_ | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark2.4_2.11/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark2.4_2.11)   |                                                                                                                                                                                                                                  |                                                                                                                                                                                                                                  | \n| _Scala 2.12_ | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark2.4_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark2.4_2.12)   | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark3.2_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark3.2_2.12) | [![Maven Central](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark3.3_2.12/badge.svg)](https://maven-badges.herokuapp.com/maven-central/za.co.absa/spark-partition-sizing-spark3.3_2.12) | \n\n\n### SBT\n```scala\nlibraryDependencies += \"za.co.absa\" %% \"spark-partition-sizing-spark${sparkVersion}\" % \"X.Y.Z\"\n```\nso, e.g. \n```scala\nlibraryDependencies += \"za.co.absa\" %% \"spark-partition-sizing-spark2.4\" % \"X.Y.Z\"\nlibraryDependencies += \"za.co.absa\" %% \"spark-partition-sizing-spark3.2\" % \"X.Y.Z\"\nlibraryDependencies += \"za.co.absa\" %% \"spark-partition-sizing-spark3.3\" % \"X.Y.Z\"\n```\n\n### Maven\n```xml\n\u003cdependency\u003e\n   \u003cgroupId\u003eza.co.absa\u003c/groupId\u003e\n   \u003cartifactId\u003espark-partition-sizing-spark${sparkVersion}_${scalaVersion}\u003c/artifactId\u003e\n   \u003cversion\u003e${latest_version}\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### Building and testing\nTo build and test the package locally, run:\n```\nsbt clean test\n```\n\n### How to generate Code coverage report\n```sbt\nsbt ++{scala_version} jacoco\n```\nCode coverage will be generated on path:\n```\n{project-root}/target/scala-{scala_version}/jacoco/report/html\n```\n\n## Repartitioning\n\nThe goal of the `DataFramePartitioner` class is to offer new partitioning possibilities and other helping functions.\n\n### repartitionByPlanSize\n\nRepartitions the `DataFrame` so the partition size is between the provided _min_ and _max_ parameters, judged by the execution\nplan, thus if the current partition size is not within the specified range, the dataframe will be repartitioned so that the \nblock size will be within the range.\n\n```scala\n    import za.co.absa.spark.partition.sizing.DataFramePartitioner.DataFrameFunctions\n    val minPartitionSizeInBytes = Some(1048576)//1mb\n    val maxPartitionSizeInBytes = Some(2097152)//2mbs\n\n    //specifying a range of values\n    val repartitionedDfWithRange = df.repartitionByPlanSize(minPartitionSizeInBytes, maxPartitionSizeInBytes)\n    //specifying a minimum partition size\n    val repartitionedDfWithMax = df.repartitionByPlanSize(None, maxPartitionSizeInBytes)\n    //specifying a maximum partition size\n    val repartitionedDfWithMin = df.repartitionByPlanSize(minPartitionSizeInBytes, None)\n```\n\n### repartitionByRecordCount\n\nRepartitions the `DataFrame` that each partition contains roughly the provided number of records\n\n```scala\n    import za.co.absa.spark.partition.sizing.DataFramePartitioner.DataFrameFunctions\n    val targetNrOfRecordsPerPartition = 100\n    val repartitionedDf = df.repartitionByRecordCount(targetNrOfRecordsPerPartition)\n```\n\n### repartitionByDesiredSize\n\nSimilarly to `repartitionByPlanSize`, it estimates the total size of the dataframe and checks if that estimation is within \nthe specified _min_ and _max_ range. However, the difference is that other options for computing the estimated dataset size \nare available, as opposed the plan size which may not always give accurate results.\n Providing a way of estimating the total size is done through Sizers. These are the available sizers:\n\n#### FromDataframeSizer\n\nIt estimates the size of each row and then sums the values, thus giving a better estimate of the data size\n and not needing an extra parameter. Its main drawback is that it can be slow, since all the data is being processed and\n  may fail on deeply nested rows if the computing resources are limited.\n\n```scala\n    import za.co.absa.spark.partition.sizing.DataFramePartitioner.DataFrameFunctions\n    import za.co.absa.spark.partition.sizing.sizer.FromDataframeSizer\n    val minPartitionSizeInBytes = Some(1048576)//1mb\n    val maxPartitionSizeInBytes = Some(2097152)//2mbs\n    \n    val sizer = new FromDataframeSizer()\n\n    //specifying a range of values\n    val repartitionedDfWithRange = df.repartitionByDesiredSize(sizer)(minPartitionSizeInBytes, maxPartitionSizeInBytes)\n```\n\n#### FromSchemaSizer\n\nEstimate the row size based on a dataframe schema and the expected/typical field sizes. Its main advantage is that this approach is quite quick, since\n the data from the dataset will not be used and no action will run through the data. The accuracy of the estimation, however,\n  is not likely to be high, since nullability or complex structures may not be so well estimated.\n  It needs an implicit parameter for computing the data sizes, which can be a custom user provided one or the default DefaultDataTypeSizes.\n\n```scala\n    import za.co.absa.spark.partition.sizing.DataFramePartitioner.DataFrameFunctions\n    import za.co.absa.spark.partition.sizing.sizer.FromSchemaSizer\n    import za.co.absa.spark.partition.sizing.types.DataTypeSizes\n    import za.co.absa.spark.partition.sizing.types.DataTypeSizes.DefaultDataTypeSizes\n    val minPartitionSizeInBytes = Some(1048576)//1mb\n    val maxPartitionSizeInBytes = Some(2097152)//2mbs\n    \n    implicit val defaultSizes: DataTypeSizes = DefaultDataTypeSizes\n    \n    val sizer = new FromSchemaSizer()\n\n    //specifying a range of values\n    val repartitionedDfWithRange = df.repartitionByDesiredSize(sizer)(minPartitionSizeInBytes, maxPartitionSizeInBytes)\n```\n\n#### FromSchemaWithSummariesSizer\n\nSimilarly to `FromSchemaSizer`, it uses the schema, but also takes the nullability into account by using the dataset summaries to \ncompute what percentage of each column is null. Therefore, it will apply a weight(the percentage of non-null values) to each column's estimation.\nThis approach has the advantage of being quicker, but somewhat slower than `FromSchemaSizer`.\nIts main limitation lies in the fact that it can only be used when all the columns of the dataset are primitive, non-nested values,\n the limitation being due to the Spark summary statistics.\n\n```scala\n    import za.co.absa.spark.partition.sizing.DataFramePartitioner.DataFrameFunctions\n    import za.co.absa.spark.partition.sizing.sizer.FromSchemaWithSummariesSizer\n    import za.co.absa.spark.partition.sizing.types.DataTypeSizes\n    import za.co.absa.spark.partition.sizing.types.DataTypeSizes.DefaultDataTypeSizes\n    val minPartitionSizeInBytes = Some(1048576)//1mb\n    val maxPartitionSizeInBytes = Some(2097152)//2mbs\n    \n    implicit val defaultSizes: DataTypeSizes = DefaultDataTypeSizes\n    \n    val sizer = new FromSchemaWithSummariesSizer()\n\n    //specifying a range of values\n    val repartitionedDfWithRange = df.repartitionByDesiredSize(sizer)(minPartitionSizeInBytes, maxPartitionSizeInBytes)\n\n```\n#### FromDataframeSampleSizer\n\nEstimate the data size based on a few random sample rows taken from the whole data. The number of samples can be specified by the user, otherwise the default would be 1.\nThis approach has the advantage of being quicker, since taking samples is not so costly as going through all the data,\n but the total estimated size will be dependent on the random samples, therefore a higher number of samples is likely to give a better estimate, but be costlier.\n\n```scala\n    import za.co.absa.spark.partition.sizing.DataFramePartitioner.DataFrameFunctions\n    import za.co.absa.spark.partition.sizing.sizer.FromDataframeSampleSizer\n\n    val numberOfSamples = 10 \n    val minPartitionSizeInBytes = Some(1048576)//1mb\n    val maxPartitionSizeInBytes = Some(2097152)//2mbs\n    \n    val sizer = new FromDataframeSampleSizer(numberOfSamples)\n\n    //specifying a range of values\n    val repartitionedDfWithRange = df.repartitionByDesiredSize(sizer)(minPartitionSizeInBytes, maxPartitionSizeInBytes)\n```\n\n## How to Release\n\nPlease see [this file](RELEASE.md) for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-partition-sizing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabsaoss%2Fspark-partition-sizing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabsaoss%2Fspark-partition-sizing/lists"}