{"id":16194748,"url":"https://github.com/semyonsinchenko/flake8-pyspark-with-column","last_synced_at":"2025-03-19T04:30:49.428Z","repository":{"id":257790557,"uuid":"861199435","full_name":"SemyonSinchenko/flake8-pyspark-with-column","owner":"SemyonSinchenko","description":"A flake8 plugin that detects of usage withColumn in a loop or inside reduce","archived":false,"fork":false,"pushed_at":"2025-01-15T10:17:35.000Z","size":173,"stargazers_count":27,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-17T03:41:46.501Z","etag":null,"topics":["flake8","flake8-plugin","flake8-plugins","lint","linter","linting","pyspark"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/flake8-pyspark-with-column/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SemyonSinchenko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-22T09:18:32.000Z","updated_at":"2025-03-04T20:47:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"87584e59-8c33-4b18-ae42-7b86d5b07db1","html_url":"https://github.com/SemyonSinchenko/flake8-pyspark-with-column","commit_stats":null,"previous_names":["semyonsinchenko/flake8-pyspark-with-column"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SemyonSinchenko%2Fflake8-pyspark-with-column","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SemyonSinchenko%2Fflake8-pyspark-with-column/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SemyonSinchenko%2Fflake8-pyspark-with-column/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SemyonSinchenko%2Fflake8-pyspark-with-column/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SemyonSinchenko","download_url":"https://codeload.github.com/SemyonSinchenko/flake8-pyspark-with-column/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244357464,"owners_count":20440332,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["flake8","flake8-plugin","flake8-plugins","lint","linter","linting","pyspark"],"created_at":"2024-10-10T08:24:47.729Z","updated_at":"2025-03-19T04:30:49.423Z","avatar_url":"https://github.com/SemyonSinchenko.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Flake8-pyspark-with-column\n\n[![Upload Python Package](https://github.com/SemyonSinchenko/flake8-pyspark-with-column/actions/workflows/python-publish.yml/badge.svg)](https://github.com/SemyonSinchenko/flake8-pyspark-with-column/actions/workflows/python-publish.yml) ![PyPI - Downloads](https://img.shields.io/pypi/dm/flake8-pyspark-with-column)\n\n## Getting started\n\n```sh\npip install flake8-pyspark-with-column\nflake8 --select PSRPK001,PSPRT002,PSPRK003,PSPRK004\n```\n\nAlternatively you can add the following `tox.ini` file to the root of your project:\n\n```\n[flake8]\nselect = \n    PSPRK001,\n    PSPRK002,\n    PSPRK003,\n    PSPRK004\n```\n\n## About\n\nA flake8 plugin that detects of usage `withColumn` in a loop or inside `reduce`. From the PySpark documentation about `withColumn` method:\n\n\u003e This method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select() with multiple columns at once.\n\n### What happens under the hood?\n\nWhen you run a PySpark application the following happens:\n\n1. Spark creates `Unresolved Logical Plan` that is a result of parsing SQL\n2. Spark do analysis of this plan to create an `Analyzed Logical Plan`\n3. Spark apply optimization rules to create an `Optimized Logical Plan`\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://www.databricks.com/wp-content/uploads/2018/05/Catalyst-Optimizer-diagram.png\" alt=\"spark-flow\" width=\"800\" align=\"middle\"/\u003e\n\u003c/p\u003e\n\nWhat is the problem with `withColumn`? It creates a single node in the unresolved plan. So, calling `withColumn` 500 times will create an unresolved plan with 500 nodes. During the analysis Spark should visit each node to check that column exists and has a right data type. After that Spark will start applying rules, but rules are applyed once per plan recursively, so concatenation of 500 calls to `withColumn` will require 500 applies of the corresponding rule. All of that may significantly increase the amount of time from `Unresolved Logical Plan` to `Optimized Logical Plan`:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/SemyonSinchenko/flake8-pyspark-with-column/refs/heads/main/static/with_column_performance.png\" alt=\"bechmark\" width=\"600\" align=\"middle\"/\u003e\n\u003c/p\u003e\n\nFrom the other side, both `withColumns` and `select(*cols)` create only one node in the plan doesn't matter how many columns we want to add.\n\n## Rules\nThis plugin contains the following rules:\n\n- `PSPRK001`: Usage of withColumn in a loop detected\n- `PSPRK002`: Usage of withColumn inside reduce is detected\n- `PSPRK003`: Usage of withColumnRenamed in a loop detected\n- `PSPRK004`: Usage of withColumnRenamed inside reduce is detected\n\n### Examples\n\nLet's imagine we want to apply an ML model to our data but our Model expects double values and our table contain decimal values. The goal is to cast all `Decimal` columns to `Double`.\n\nImplementation with `withColumn` (bad example):\n\n```python\ndef cast_to_double(df: DataFrame) -\u003e DataFrame:\n  for field in df.schema.fields:\n    if isinstance(field.dataType, DecimalType):\n      df = df.withColumn(field.name, col(field.name).cast(DoubleType()))\n  return df\n```\n\nImplementation without `withColumn` (good example):\n\n```python\ndef cast_to_double(df: DataFrame) -\u003e DataFrame:\n  cols_to_select = []\n  for field in df.schema.fields:\n    if isinstance(field.dataType, DecimalType):\n      cols_to_select.append(col(field.name).cast(DoubleType()).alias(field.name))\n    else:\n      cols_to_select.append(col(field.name))\n  return df.select(*cols_to_select)\n```\n\n## Usage\n\n`flake8 %your-code-here%`\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/SemyonSinchenko/flake8-pyspark-with-column/refs/heads/main/static/usage.png\" alt=\"screenshot of how it works\" width=\"800\" align=\"middle\"/\u003e\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemyonsinchenko%2Fflake8-pyspark-with-column","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsemyonsinchenko%2Fflake8-pyspark-with-column","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsemyonsinchenko%2Fflake8-pyspark-with-column/lists"}