{"id":13471323,"url":"https://github.com/ericxiao251/spark-syntax","last_synced_at":"2025-03-26T13:31:00.746Z","repository":{"id":42163879,"uuid":"101570493","full_name":"ericxiao251/spark-syntax","owner":"ericxiao251","description":"This is a repo documenting the best practices in PySpark.","archived":false,"fork":false,"pushed_at":"2022-12-08T18:20:12.000Z","size":4931,"stargazers_count":460,"open_issues_count":10,"forks_count":76,"subscribers_count":15,"default_branch":"master","last_synced_at":"2024-10-30T02:59:29.577Z","etag":null,"topics":["best-practices","pyspark"],"latest_commit_sha":null,"homepage":"https://ericxiao251.github.io/spark-syntax/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ericxiao251.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-08-27T17:59:59.000Z","updated_at":"2024-10-27T09:50:00.000Z","dependencies_parsed_at":"2023-01-25T15:31:28.848Z","dependency_job_id":null,"html_url":"https://github.com/ericxiao251/spark-syntax","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericxiao251%2Fspark-syntax","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericxiao251%2Fspark-syntax/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericxiao251%2Fspark-syntax/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ericxiao251%2Fspark-syntax/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ericxiao251","download_url":"https://codeload.github.com/ericxiao251/spark-syntax/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245662784,"owners_count":20652084,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["best-practices","pyspark"],"created_at":"2024-07-31T16:00:43.085Z","updated_at":"2025-03-26T13:31:00.436Z","avatar_url":"https://github.com/ericxiao251.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","What to hire for:"],"sub_categories":[],"readme":"# Spark-Syntax\n\nThis is a public repo documenting all of the \"best practices\" of writing PySpark code from what I have learnt from working with `PySpark` for 3 years. This will mainly focus on the `Spark DataFrames and SQL` library.\n\nyou can also visit ericxiao251.github.io/spark-syntax/ for a online book version.\n\n# Contributing/Topic Requests\n\nIf you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it 😁, you'll most likely be right.\n\nIf you have any topics that I could potentially go over, please create an **issue** and describe the topic. I'll try my best to address it 😁.\n\n# Acknowledgement\n\nHuge thanks to Levon for turning everything into a gitbook. You can follow his github at https://github.com/tumregels.\n\n# Table of Contexts:\n\n## Chapter 1 - Getting Started with Spark:\n* #### 1.1 - [Useful Material](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%201%20-%20Basics/Section%201%20-%20Useful%20Material.md)\n* #### 1.2 - [Creating your First DataFrame](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%201%20-%20Basics/Section%202%20-%20Creating%20your%20First%20Data%20Object.ipynb)\n* #### 1.3 - [Reading your First Dataset](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%201%20-%20Basics/Section%203%20-%20Reading%20your%20First%20Dataset.ipynb)\n* #### 1.4 - [More Comfortable with SQL?](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%201%20-%20Basics/Section%204%20-%20More%20Comfortable%20with%20SQL%3F.ipynb)\n\n## Chapter 2 - Exploring the Spark APIs:\n* #### 2.1 - Non-Trivial Data Structures in Spark\n    * ##### 2.1.1 - [Struct Types](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%201.1%20-%20Struct%20Types.ipynb) (`StructType`)\n    * ##### 2.1.2 - [Arrays and Lists](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%201.2%20-%20Arrays%20and%20Lists.ipynb) (`ArrayType`)\n    * ##### 2.1.3 - [Maps and Dictionaries](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%201.3%20-%20Maps%20and%20Dictionaries.ipynb) (`MapType`)\n    * ##### 2.1.4 - [Decimals and Why did my Decimals overflow :(](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%201.4%20-%20Decimals%20and%20Why%20did%20my%20Decimals%20Overflow.ipynb) (`DecimalType`)\n* #### 2.2 - [Performing your First Transformations](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202%20-%20Performing%20your%20First%20Transformations.ipynb)\n    * ##### 2.2.1  - [Looking at Your Data](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.1%20-%20Looking%20at%20Your%20Data.ipynb) (`collect`/`head`/`take`/`first`/`toPandas`/`show`)\n    * ##### 2.2.2  - [Selecting a Subset of Columns](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.2%20-%20Selecting%20a%20Subset%20of%20Columns.ipynb) (`drop`/`select`)\n    * ##### 2.2.3  - [Creating New Columns and Transforming Data](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.3%20-%20Creating%20New%20Columns%20and%20Transforming%20Data.ipynb) (`withColumn`/`withColumnRenamed`)\n    * ##### 2.2.4  - [Constant Values and Column Expressions](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.4%20-%20Constant%20Values%20and%20Column%20Expressions.ipynb) (`lit`/`col`)\n    * ##### 2.2.5  - [Casting Columns to a Different Type](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.5%20-%20Casting%20Columns%20to%20Different%20Type.ipynb) (`cast`)\n    * ##### 2.2.6  - [Filtering Data](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.6%20-%20Filtering%20Data.ipynb) (`where`/`filter`/`isin`)\n    * ##### 2.2.7  - [Equality Statements in Spark and Comparisons with Nulls](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.7%20-%20Equality%20Statements%20in%20Spark%20and%20Comparison%20with%20Nulls.ipynb) (`isNotNull()`/`isNull()`)\n    * ##### 2.2.8  - [Case Statements](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.8%20-%20Case%20Statements.ipynb) (`when`/`otherwise`)\n    * ##### 2.2.9  - [Filling in Null Values](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.9%20-%20Filling%20in%20Null%20Values.ipynb) (`fillna`/`coalesce`)\n    * ##### 2.2.10  - [Spark Functions aren't Enough, I Need my Own!](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.10%20-%20Spark%20Functions%20aren't%20Enough%2C%20I%20Need%20my%20Own!.ipynb) (`udf`/`pandas_udf`)\n    * ##### 2.2.11  - [Unionizing Multiple Dataframes](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.11%20%20-%20Unionizing%20Multiple%20Dataframes.ipynb) (`union`)\n    * ##### 2.2.12 - [Performing Joins (clean one)](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%202.12%20-%20Performing%20Joins%20(clean%20one).ipynb) (`join`)\n* #### 2.3 More Complex Transformations\n    * ##### 2.3.1 - [One to Many Rows](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%203.1%20-%20One%20to%20Many%20Rows.ipynb) (`explode`)\n    * ##### 2.3.2 - [Range Join Conditions (WIP)](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%202%20-%20Exploring%20the%20Spark%20APIs/Section%203.2%20-%20Range%20Join%20Conditions%20(WIP).ipynb) (`join`)\n* #### 2.4 Potential Performance Boosting Functions\n    * ##### 2.4.1 - (`repartition`)\n    * ##### 2.4.2 - (`coalesce`)\n    * ##### 2.4.2 - (`cache`)\n    * ##### 2.4.2 - (`broadcast`)\n\n## Chapter 3 - Aggregates:\n* #### 3.1 - [Clean Aggregations](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%203%20-%20Aggregates/Section%201%20-%20Clean%20Aggregations.ipynb)\n* #### 3.2 - [Non Deterministic Behaviours](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%203%20-%20Aggregates/Section%202%20-%20Non%20Deterministic%20Ordering%20for%20GroupBys.ipynb)\n\n## Chapter 4 - Window Objects:\n* #### 4.1 - [Default Ordering on a Window Object](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%205%20-%20Window%20Objects/Section%201%20-%20Default%20Behaviour%20of%20a%20Window%20Object.ipynb)\n* #### 4.2 - [Ordering High Frequency Data with a Window Object](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%205%20-%20Window%20Objects/Section%202%20-%20Ordering%20High%20Frequency%20Data%20with%20a%20Window%20Object.ipynb)\n\n## Chapter 5 - Error Logs:\n\n## Chapter 6 - Understanding Spark Performance:\n* #### 6.1 - Primer to Understanding Your Spark Application\n    * #### 6.1.1 - [Understanding how Spark Works](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%206%20-%20Tuning%20%26%20Spark%20Parameters/Section%201.1%20-%20Understanding%20how%20Spark%20Works.md)\n    * #### 6.1.2 - Understanding the SparkUI\n    * #### 6.1.3 - Understanding how the DAG is Created\n    * #### 6.1.4 - Understanding how Memory is Allocated\n* #### 6.2 - Analyzing your Spark Application\n    * #### 6.1 - Looking for Skew in a Stage\n    * #### 6.2 - Looking for Skew in the DAG\n    * #### 6.3 - How to Determine the Number of Partitions to Use\n* #### 6.3 - How to Analyze the Skew of Your Data\n\n## Chapter 7 - High Performance Code:\n* #### 7.0 - The Types of Join Strategies in Spark\n  * ##### 7.0.1 - You got a Small Table? (`Broadcast Join`)\n  * ##### 7.0.2 - The Ideal Strategy (`BroadcastHashJoin`)\n  * ##### 7.0.3 - The Default Strategy (`SortMergeJoin`)\n* #### 7.1 - Improving Joins\n    * ##### 7.1.1 - [Filter Pushdown](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%207%20-%20High%20Performance%20Code/Section%201.1%20-%20Filter%20Pushdown.ipynb)\n    * ##### 7.1.2 - [Joining on Skewed Data (Null Keys)](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%207%20-%20High%20Performance%20Code/Section%201.2%20-%20Joins%20on%20Skewed%20Data%20(Null%20Keys).ipynb)\n    * ##### 7.1.3 - [Joining on Skewed Data (High Frequency Keys I)](https://github.com/ericxiao251/spark-syntax/blob/master/src/Chapter%207%20-%20High%20Performance%20Code/Section%201.3%20-%20Joins%20on%20Skewed%20Data%20(High%20Frequency%20Keys%20I).ipynb)\n    * ##### 7.1.4 - Joining on Skewed Data (High Frequency Keys II)\n    * ##### 7.1.5 - Join Ordering\n* #### 7.2 - Repeated Work on a Single Dataset (`caching`)\n    * ##### 7.2.1 - caching layers\n* #### 7.3 - Spark Parameters\n  * ##### 7.3.1 - Running Multiple Spark Applications at Scale (`dynamic allocation`)\n  * ##### 7.3.2 - The magical number `2001` (`partitions`)\n  * ##### 7.3.3 - Using a lot of `UDF`s? (`python memory`)\n* #### 7. - Bloom Filters :o?\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fericxiao251%2Fspark-syntax","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fericxiao251%2Fspark-syntax","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fericxiao251%2Fspark-syntax/lists"}