{"id":20271717,"url":"https://github.com/rodneyshag/spark","last_synced_at":"2025-07-25T13:04:14.261Z","repository":{"id":123421862,"uuid":"209907825","full_name":"RodneyShag/Spark","owner":"RodneyShag","description":"Spark tutorial","archived":false,"fork":false,"pushed_at":"2020-07-08T00:34:05.000Z","size":542,"stargazers_count":5,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-14T05:49:49.396Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RodneyShag.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-21T01:36:32.000Z","updated_at":"2021-08-24T05:13:01.000Z","dependencies_parsed_at":null,"dependency_job_id":"3042396d-d58e-4c02-aa64-b9566809d92d","html_url":"https://github.com/RodneyShag/Spark","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FSpark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FSpark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FSpark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FSpark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RodneyShag","download_url":"https://codeload.github.com/RodneyShag/Spark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241758964,"owners_count":20015251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T12:39:10.940Z","updated_at":"2025-03-04T00:05:32.312Z","avatar_url":"https://github.com/RodneyShag.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"images/spark_logo.png\"\u003e\n\u003c/p\u003e\n\n- [Tutorial from IntelliPath](#tutorial-from-intellipath)\n    - [Architecture](#architecture)\n    - [Resilient Distributed Dataset (RDD)](#resilient-distributed-dataset-rdd)\n        - [3 ways to create RDDs](#3-ways-to-create-rdds)\n        - [Transformations on RDDs](#transformations-on-rdds)\n        - [Actions on RDDs](#actions-on-rdds)\n    - [Creating Data Frames](#creating-data-frames)\n- [Tutorial from Coursera](#tutorial-from-coursera)\n    - [Introduction](#introduction)\n    - [Resilient Distributed Datasets (RDDs)](#resilient-distributed-datasets-rdds)\n    - [Transformations and Actions](#transformations-and-actions)\n    - [Evaluation in Spark](#evaluation-in-spark)\n    - [Reduction Operations](#reduction-operations)\n    - [Pair RDDs](#pair-rdds)\n    - [Transformations and Actions on Pair RDDs](#transformations-and-actions-on-pair-rdds)\n    - [Joins](#joins)\n    - [Shuffling](#shuffling)\n    - [Partitioning](#partitioning)\n    - [Wide vs Narrow Dependencies](#wide-vs-narrow-dependencies)\n    - [Structure and Optimization](#structure-and-optimization)\n    - [Spark SQL](#spark-sql)\n    - [Data Frames](#data-frames)\n    - [Datasets](#datasets)\n- [User Defined Functions (UDFs)](#user-defined-functions-udfs)\n- [SparkException: Task not serializable](#sparkexception-task-not-serializable)\n- [References](#references)\n\nThis repo is a concise summary and _replacement_ of the tutorials by IntelliPath and Coursera. Using the hyperlinks below is optional.\n\n\n# [Tutorial from IntelliPath](https://www.youtube.com/watch?v=GFC2gOL1p9k)\n\n## [Architecture](https://youtu.be/GFC2gOL1p9k?t=549)\n\nSpark is an open source, scalable tool for parallel-processing.\n\nSpark is polygot: code can be written in Scala (most popular), Java, Python, R, and Spark SQL. It provides high-level APIs for these languages.\n\n![Spark Architecture](./images/sparkArchitecture.png)\n\n- __Driver program__ - The code you're writing behaves as a \"driver program\". The interactive shell you're writing code in is a sample driver program.\n- __Cluster manager__ - manages various jobs. Sample cluster managers: Spark Standalone Cluster, Apache Mesos, Hadoop Yarn, Kubernetes\n- __Worker nodes__ - they execute a task and return it to the \"Spark context\". They provide in-memory storage for cached RDDs (explained below)\n\n\n## Resilient Distributed Dataset (RDD)\n\nRDDs are the fundamental data structure of Spark.\n\n### [3 ways to create RDDs](https://youtu.be/GFC2gOL1p9k?t=794)\n\n[1) Parallelize a collection](https://youtu.be/GFC2gOL1p9k?t=817)\n\n```scala\nval myFirstRDD = sc.parallelize(List(\"spark\", \"scala\", \"hadoop\"))\n```\n\n[2) Use a data set in an external storage system](https://youtu.be/GFC2gOL1p9k?t=845)\n\n```scala\nval textRDD = sc.textFile(\"/user/cloudera/data.txt\")\n```\n\n[3) Create an RDD from already existing RDDs](https://youtu.be/GFC2gOL1p9k?t=870)\n\nUsing `textRDD` from above:\n\n```scala\nval newRDD = textRDD.filter(x =\u003e x.contains(\"spark\"))\n```\n\n\n### [Transformations on RDDs](https://youtu.be/GFC2gOL1p9k?t=908)\n\nFunction: `map`\n\n```scala\nval x = sc.parallelize(List(\"spark\", \"rdd\", \"example\", \"sample\", \"example\"))\nval y = x.map(x =\u003e (x, 1))\ny.collect\n```\n\n```scala\n// Output\nArray[(String, Int)] = Array((spark,1), (rdd, 1), (example,1), (sample,1), example,1))\n```\n\nFunction: `flatmap` - map returns 1 element, while flatmap can return a list of elements\n\n```scala\nsc.parallelize(List(1, 2, 3)).flatMap(x=\u003eList(x, x, x)).collect\n```\n\n```scala\n// Output\nArray(1, 1, 1, 2, 2, 2, 3, 3, 3)\n```\n\nFunction: `filter`\n\n```scala\nsc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))\nnumbers.filter(_ % 2 == 0).collect\n```\n\n```scala\n// Output\nArray(2, 4, 6, 8, 10)\n```\n\nFunction: `intersection`\n\n```scala\nval parallel = sc.parallelize(1 to 9)\nval par2 = sc.parallelize(5 to 15)\nparallel.intersection(par2).collect\n```\n\n```scala\n// Output\nArray(6, 8, 7, 9, 5)\n```\n\n\n### [Actions on RDDs](https://youtu.be/GFC2gOL1p9k?t=1204)\n\n[Actions are Spark RDD operations that give non-RDD values](https://www.youtube.com/watch?v=GFC2gOL1p9k)\n\nFunction: `reduce`\n\n```scala\nval a = sc.parallelize(1 to 10)\na.reduce(_ + _)\n```\n\n```scala\n// Output\nOutput is: `Int = 55`\n```\n\nFunction: `first`\n\n```scala\nval names2 = sc.parallelize(List(\"apple\", \"beatty\", \"beatrice\"))\nnames2.first\n```\n\n```scala\n// Output\nString = apple\n```\n\nFunction: `take`\n\n```scala\nval nums = sc.parallelize(List(1, 5, 3, 9, 4, 0, 2)\nnums.take(4)\n```\n\n```scala\n// Output\nArray[Int] = Array(1, 5, 3, 9)\n```\n\nFunction: `foreachPartition`\n\n```scala\nval b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 3)\nb.foreachPartition(x =\u003e println(x.reduce(_ + _)))\n```\n\n```scala\n// Output\n6\n15\n24\n```\n\n\n## [Creating Data Frames](https://youtu.be/GFC2gOL1p9k?t=1750)\n\n### [Create a Data Frame from a List](https://youtu.be/GFC2gOL1p9k?t=1759)\n\n```scala\nList((1, \"mobile\", 50000), (2, \"shoes\", 4500), (3, \"TV\", 70000))\n\nval productDF = product.toDF(\"pid\", \"product\", \"value\") // column names\nproductDF.show()\n```\n\n\n```scala\n// Output\n\n+---+-------+-----+\n|pid|product|value|\n+---+-------+-----+\n|  1| mobile|50000|\n|  2|  shoes| 4500|\n|  3|     TV|70000|\n+---+-------+-----+\n```\n\n\n### [Create a Data Frame from a JSON file](https://youtu.be/GFC2gOL1p9k?t=1826)\n\n\n```scala\nval df = spark.read.json(\"/student1.json\")\ndf.show()\n```\n\n```scala\n// Output\n\n+----+------+\n| age|  name|\n+----+------+\n|null|   Sam|\n|  17|  Mick|\n|  18|Jennet|\n|  19|Serena|\n+----+------+\n```\n\nLet's print the schema:\n\n```scala\ndf.printSchema()\n```\n\n```scala\n// Output\n\nroot\n |-- age: long (nullable = true)\n |-- name: long (nullable = true)\n```\n\nLet's `select` a column:\n\n```scala\ndf.select(\"name\").show()\n```\n\n```scala\n// Output\n\n+------+\n|  name|\n+------+\n|   Sam|\n|  Mick|\n|Jennet|\n|Serena|\n+------+\n```\n\nLet's `filter` for age greater than 18:\n\n```scala\ndf.filter($\"age\" \u003e= 18).show()\n```\n\n```scala\n// Output\n\n+----+------+\n| age|  name|\n+----+------+\n|  18|Jennet|\n|  19|Serena|\n+----+------+\n```\n\n\n# [Tutorial from Coursera](https://www.coursera.org/learn/scala-spark-big-data?specialization=scala)\n\n## Introduction\n\nSpark keeps all data __immutable__ and __in-memory__.\n\nAll operations on data are just functional transformations, like regular Scala collections.\n\nFault tolerance is achieved by replaying functional transformations over original dataset. This makes Spark up to 100x faster than Hadoop/Map-Reduce which use disk writes to achieve fault tolerance.\n\n## Resilient Distributed Datasets (RDDs)\n\nMost operations on RDDs are higher-order functions\n\n```scala\nabstract class RDD[T] {\n  def map[U](f: T =\u003e U): RDD[U] = ...\n  def flatMap[U](f: T =\u003e TraversableOnce[U]): RDD[U] = ...\n  def filter(f: T =\u003e Boolean): RDD[T] = ...\n  def reduce(f: (T, T) =\u003e T): T = ...\n  ...\n}\n```\n\n### Example: Count of a specific word in Spark\n\nGiven an `encyclopedia: RDD[String]`, we can count how many times \"EPFL\" appears in `encyclopedia`:\n\n```scala\nval result = encyclopedia.filter(page =\u003e page.contains(\"EPFL\")).count()\n```\n\n### Example: Word count in Spark\n\n```scala\nval rdd = spark.textFile(\"hdfs://...\")\n\nval count = rdd.flatMap(line =\u003e line.split(\" \")) // separate lines into words\n               .map(word =\u003e (word, 1))           // include something to count\n               .reduceByKey(_ + _)               // sum up the 1s in the pairs\n```\n\n## Transformations and Actions\n\n- _Transformations_ return new RDDs as results. Examples: `map`, `filter`, `flatMap`, `groupBy`\n- _Actions_ return a result based on an RDD, and it's either returned or saved to an external storage system. Examples: `reduce`, `fold`, `reduce`\n\nTransformations are _lazy_ (delayed execution), and actions are _eager_ (immediate execution). So none of the _transformations_ happen until there is an _action_.\n\nTo know if a function is a _transformation_ or an _action_, we look at its return type. If the return type is an RDD, it's a _transformation_, otherwise it's an _action_.\n\n### Lazy evaluation resulting in efficiency\n\nSpark will analyze and optimize a chain of operations before executing it. This is a benefit of _lazy_ evaluation. In the code below, as soon as 10 elements of the filtered RDD have been computed, `firstLogsWithErrors` is done.\n\n```scala\nval lastYearsLogs: RDD[String] = ...\nval firstLogsWithErrors = lastYearsLogs.filter(_.contains(\"ERROR\")).take(10)\n```\n\nSpark (unlike Scala) can also combine the below `map` and `filter` so that it doesn't have to iterate through the list twice:\n\n```scala\nval lastYearsLogs: RDD[String] = ...\nval numErrors = lastYearsLogs.map(_.lowercase)\n                             .filter(_.contains(\"error\"))\n                             .count()\n```\n\n## Evaluation in Spark\n\n### Caching and Persistence\n\nBy default, RDDs are recomputed each time you run an action on them. This can be expensive (in time) if you need to use a dataset more than once.\nTo tell Spark to cache an RDD in memory, simply call `persist()` or `cache()` on it:\n\n```scala\nval lastYearsLogs: RDD[String] = ...\nval logsWithErrors = lastYearsLogs.filter(_.contains(\"ERROR\")).persist()\nval firstLogsWithErrors = logsWithErrors.take(10)\nval numErrors = logsWithErrors.count() // faster since we used .persist() above\n```\n\nThe `persist()` method can be customized in 5 ways in how data is persisted:\n1. in memory as regular Java objects - has a shorthand function for it: `cache()` instead of `persist()`\n1. on disk as regular Java objects\n1. in memory as serialized Java objects (more compact)\n1. on disk as serialized Java objects (more compact)\n1. both in memory and on disk (spill over to disk to avoid re-computation)\n\nScala Collections and Spark RDDs have similar-looking APIs. However, Spark RDDs use lazy evaluation while Scala Collections do not (by default)\n\n### Common pitfall: `println` in a cluster\n\nWhat happens in this scenario?\n\n```scala\ncase class Person(name: String, age: Int)\nval people: RDD[Person] = ...\npeople.foreach(println)\n```\n\nSince `println` is an action with return type of `Unit`, the `println` happens in the cluster (instead of the driver program), and the output is never seen by the user.\n\n\n## Reduction Operations\n\n`foldLeft` and `foldRight` are not parallelizable, so they do not exist for Spark's RDDs. We use `fold`, `reduce`, and `aggregate` instead.\n\nThe `Aggregate` function has a signature of `aggregate[B](z: =\u003e B)(seqop: (B, A) =\u003e B, combop: (B, B) =\u003e B): B`\n\n\n## Pair RDDs\n\nPair RDDs is just another name for distributed key-value pairs.\n\nIn distributed systems, Pair RDDs are used more often then arrays and lists.\n\n### Creating a Pair RDD from a JSON record\n\n```json\n\"definitions\": {\n  \"firstname\": \"string\",\n  \"lastname\": \"string\",\n  \"address\": {\n    \"type\": \"object\",\n    \"properties\": {\n      \"type\": \"object\",\n      \"street\": {\n        \"type\": \"string\"\n      },\n      \"city\": {\n        \"type\": \"string\"\n      },\n      \"state\": {\n        \"type\": \"string\"\n      }\n    },\n    \"required\": [\n      \"street_address\",\n      \"city\",\n      \"state\"\n    ]\n  }\n}\n```\n\nIf we only care about the \"address\" part of the above record, we can create an RDD for just that part:\n\n```scala\nRDD[(String, Property)] // String is a key representing a city, 'Property' is its corresponding value.\n\ncase class Property(street: String, city: String, state: String)\n```\n\nWe used the `city` as the key. This would be useful if we wanted to group these RDDs by their `city`, so we can do computations on these properties by city.\n\n### Creating a Pair RDD from an RDD\n\nIf given `val rdd: RDD[WikipediaPage]`, we can create a pair RDD:\n\n```scala\nval pairRdd = rdd.map(page =\u003e (page.title, page.text))\n```\n\nUnlike a standard RDD, when you have a Pair RDD such as `RDD[(K, V)]`, you get new methods such as:\n\n```scala\ndef groupByKey(): RDD[(K, Iterable[V])]\ndef reduceByKey(func: (V, V) =\u003e V): RDD[(K, V)]\ndef join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]\n```\n\n## Transformations and Actions on Pair RDDs\n\n### Pair RDD Transformation: groupByKey\n\n#### `groupBy` from Scala:\n\n```scala\ndef groupBy[K](f: A =\u003e K): Map[K, Traversable[A]]\n```\n\nLet's group by various ages:\n\n```scala\nval ages = List(2, 52, 44, 23, 17, 14, 12, 82, 51, 64)\nval grouped = ages.groupBy { age =\u003e\n  if (age \u003e= 18 \u0026\u0026 age \u003c 65) \"adult\"\n  else if (age \u003c 18) \"child\"\n  else \"senior\"\n}\n```\n\n```scala\n// Output\ngrouped: scala.collection.immutable.Map[String, List[Int]] =\n  Map(senior -\u003e List(82),\n      adult -\u003e List(52, 44, 23, 51, 64),\n      child -\u003e List(2, 17, 14, 12))\n```\n\n#### `groupByKey` for Pair RDDs in Spark\n\n```scala\ncase class Event(organizer: String, name: String, budget: Int)\nval eventsRdd = sc.parallelize(...) // \"...\" represents some data\n                  .map(event =\u003e (event.organizer, event.budget))\nval groupedRdd = eventsRdd.groupByKey()\ngroupedRdd.collect().foreach(println)\n```\n\n```scala\n// Output is something like:\n\n(Prime Sound, CompactBuffer(42000))\n(Sportorg, CompactBuffer(23000, 12000, 1400))\n```\n\n### Pair RDD Transformation: reduceByKey\n\nWe can use `reduceByKey`, which can be thought of as a combination of `groupByKey` and `reduce`-ing on all values per key.\n\n```scala\ndef reduceByKey(func: (V, V) =\u003e V): RDD[(K, V)]\n```\n\n```scala\ncase class Event(organizer: String, name: String, budget: Int)\nval eventsRdd = sc.parallelize(...) // \"...\" represents some data\n                  .map(event =\u003e (event.organizer, event.budget))\nval budgetsRdd = eventsRdd.reduceByKey(_ + _)\nreduceRdd.collect().foreach(println)\n```\n\n```scala\n// Output is something like:\n\n(Prime Sound, 42000)\n(Sportorg, 36400)\n```\n\n\n## Joins\n\n### Provided Sample Data\n\ndata called \"abos\":\n\n```scala\n(101, (\"Ruetli\", AG)),\n(102, (\"Brelaz\", DemiTarif)),\n(103, (\"Gress\", DemiTarifVisa)),\n(104, (\"Schatten\", Demitarif))\n```\n\ndata called \"locations\":\n\n```scala\n(101, \"Bern\"),\n(101, \"Thun\"),\n(102, \"Lausanne\"),\n(102, \"Geneve\"),\n(102, \"Nyon\"),\n(103, \"Zurich\"),\n(103, \"St-Gallen\"),\n(103, \"Chur\")\n```\n\n### Join\n\n```scala\ndef join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]\n```\n\nDoing a join (also known as inner join) gives us:\n\n```scala\n// Output\n\n(101, ((Ruetli, AG), Bern))\n(101, ((Ruetli, AG), Thun))\n(102, ((Brelaz, DemiTarif), Nyon))\n(102, ((Brelaz, DemiTarif), Lausanne))\n(102, ((Brelaz, DemiTarif), Geneve))\n(103, ((Gress, DemiTarifVisa), St-Gallen))\n(103, ((Gress, DemiTarifVisa, Chur))\n(103, ((Gress, DemiTarifVisa), Zurich))\n```\n\n### Left Outer Joins, Right Outer Joins\n\n```scala\ndef leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]\ndef rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]\n```\n\nUsing a left outer join:\n\n```scala\nval abosWithOptionalLocations = abos.leftOuterJoin(locations)\nabosWithOptionalLocations.collect().foreach(println)\n```\n\n```scala\n// Output\n\n(101, ((Ruetli, AG), Some(Thun)))\n(101, ((Ruetli, AG), Some(Bern)))\n(102, ((Brelaz, DemiTarif), Some(Geneve)))\n(102, ((Brelaz, DemiTarif), Some(Nyon)))\n(102, ((Brelaz, DemiTarif), Some(Lausanne)))\n(103, ((Gress, DemiTarifVisa), Some(Zurich)))\n(103, ((Gress, DemiTarifVisa), Some(St-Gallen)))\n(103, ((Gress, DemiTarifVisa), Some(Chur)))\n(104, ((Schatten, DemiTarif), None))  // notice the None\n```\n\n## Shuffling\n\nShuffling is when data is moved between nodes. This can happen when we do a `groupByKey()`. Moving data around the network like this is extremely slow.\n\n```scala\n// slow\nval purchasesPerMonthSlowLarge = purchasesRddLarge.map(p =\u003e p.customerId, p.price))\n  .groupByKey()\n  .map(p =\u003e (p._1, (p._2.size, p._2.sum)))\n  .count()\n```\n\nBy _reducing_ the data set first, we can reduce the amount of data that's sent over the network during a shuffle.\n\n```scala\n// fast\nval purchasesPerMonthFastLarge = purchasesRddLarge.map(p =\u003e p.customerId, (1, p.price)))\n  .reduceByKey((v1, v2) =\u003e (v1._1 + v2._1, v1._2, + v2._2))\n  .count()\n```\n\n## Partitioning\n\nPartitioning can bring substantial performance gains, especially if you can prevent or lower the number of shuffles.\n\n### Properties of partitions\n\n- The data within an RDD is split into several partitions.\n- Partitions never span multiple machines.\n- Each machine in the cluster contains 1+ partitions.\n- The number of partitions to use is configurable. By default, it equals the total number of cores on all executor nodes.\n\nCustomizing partitioning is only possible when working with Pair RDDs (since partitioning is done based on keys)\n\n### Hash partitioning\n\nAttempts to spread data evenly across partitions based on the key\n\n### Range partitioning\n\nThis is for keys that can have an ordering. Tuples with keys in the same range appear in the same machine. For example, if our numbers are 1 to 800, we can have 4 partitions of: [1, 200], [201, 400], [401, 600], [601, 800]\n\nInvoking `partitionBy` creates an RDD with a specified partitioner.\n\n```scala\nval pairs = purchasesRdd.map(p =\u003e (p.customerId, p.price))\n\n// 8 partitions. pairs will be sampled to create appropriate ranges.\nval tunedPartitioner = new RangePartitioner(8, pairs)\n\nval partitioned = pairs.partitionBy(tunedPartitioner).persist()\n```\n\nEach time the partitioned RDD is used, the partitioning is re-applied, resulting in unnecessary shuffling. By using `persist` we are telling Spark that once you move the data around in the network and re-partition it, persist it where it is and keep it in memory. The results of `partitionBy` should always be persisted.\n\n### 2 ways partioners can be passed around the transformation\n\n#### 1) __Partitioner from parent RDD__\n\nPair RDDs that are the result of a transformation on a _partitioned_ Pair RDD is usually configured to use the hash partitioner that was used to construct it.\n\nOperations on Pair RDDs that hold to (and propagate) a partitioner:\n\n```\ncogroup          foldByKey\ngroupWith        combineByKey\njoin             partitionBy\nleftOuterJoin    sort\nrightOuterJoin   mapValues (if parent has a partitioner)\ngroupByKey       flatMapValues (if parent has a partitioner)\nreduceByKey      filter (if parent has a partitioner)\n```\n\nall other operations will produce a result without a partitioner.\n\nNotice `map` and `flatMap` are not on otherwise list. This is because `map` and `flatMap` can change the keys in an RDD. For this reason, use `mapValues` instead of `map` whenever possible to avoid unnecessary shuffling.\n\nOperations that may cause a shuffle: `cogroup`, `groupWith`, `join`, `leftOuterJoin`, `rightOuterJoin`, `groupByKey`, `reduceByKey`, `combineByKey`, `distinct`, `intersection`, `repartition`, `coalesce`\n\n#### 2) __Automatically-set partitioners__\n\nSome operations on RDDs automatically result in an RDD with a known partitioner, for when it makes sense. Examples:\n\n- `RangePartitioner` is used when using `sortByKey`\n- `HashPartitioner` is used when using `groupByKey`\n\n\n## Wide vs Narrow Dependencies\n\n### Narrow Dependency\n\nEach partition of the parent RDD is used by at most 1 partition of the child RDD\n\n![Narrow Dependencies](./images/narrowDependencies.png)\n\nTransformations with narrow dependencies: `map`, `mapValues`, `flatMap`, `filter`, `mapPartitions`, `mapPartitionsWithIndex`\n\n### Wide dependency\n\nEach partition of the parent RDD may be depended on by multiple child partitions\n\n![Wide Dependencies](./images/wideDependencies.png)\n\nTransformations with narrow dependencies (that may cause a shuffle): `cogroup`, `groupWith`, `join`, `leftOuterJoin`, `rightOuterJoin`, `groupByKey`, `reduceByKey`, `combineByKey`, `distinct`, `intersection`, `repartition`, `coalesce`\n\n### Finding dependencies\n\n#### `dependencies()`\n\n`dependencies()` returns a sequence of `Dependency` objects, which are the dependencies used by Spark's scheduler to know how this RDD depends on other RDDs.\n\n- `dependencies()` may return:\n    - Narrow dependency objects: `OneToOneDependency`, `PruneDependency`, `RangeDependency`\n    - Wide dependency objects: `ShuffleDependency`\n\n```scala\nval wordsRdd = sc.parallelize(largeList)\nval pairs = words.Rdd.map(c =\u003e (c, 1))\n                     .groupByKey()\n                     .dependencies\n```\n\n```scala\n// Output is something like:\n\npairs: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@4294a23d)\n```\n\n#### `toDebugString()`\n\n```scala\nval wordsRdd = sc.parallelize(largeList)\nval pairs = words.Rdd.map(c =\u003e (c, 1))\n                     .groupByKey()\n                     .toDebugString\n```\n\n```scala\n// Output is something like:\n\npairs: String =\n(8) ShuffleRDD[219] at groupByKey at \u003cconsole\u003e:38 []\n +-(8) MapPartitionsRDD[218] at map at \u003cconsole\u003e:37 []\n    |  ParallelCollectionRDD[217] at parallelize at \u003cconsole\u003e:36 []\n```\n\nThe indentations in above output actually shows how Spark groups together these operations.\n\n\n## Structure and Optimization\n\n### Optimizing Inner Join\n\nIf we have:\n\n```scala\nval demographics = sc.textfile(...) // Pair RDD of (id, demographic)\nval finances = sc.textfile(...) // Pair RDD of (id, finances)\n```\n\n#### Solution 1: Inner Join then Filter\n\nAn inner join of `demographics` and `finances` will give us a type of: `(Int, (Demographic, Finances))`, which we then filter and count below:\n\n```scala\ndemographics.join(finances)\n            .filter { p =\u003e\n              p._2._1.country == \"Switzerland\" \u0026\u0026\n              p._2._2.hasFinancialDependents \u0026\u0026\n              p._2._2.hasDebt\n            }.count\n```\n\n#### Solution 2: Filter then Inner Join\n\n```scala\nval filtered = finances.filter(p =\u003e p._2.hasFinancialDependents \u0026\u0026 p._2.hasDebt)\n\ndemographics.filter(p =\u003e p._2.country == \"Switzerland\")\n            .join(filtered)\n            .count\n```\n\n#### Solution 3: Cartesian Product, then filters\n\n```scala\nval cartesian = demographics.cartesian(finances)\n\ncartesian.filter {\n  case (p1, p2) =\u003e p1._1 == p2._1\n}\n.filter {\n  case (p1, p2) =\u003e (p1._2.country == \"Switzerland\") \u0026\u0026\n                   (p2._2.hasFinancialDependents) \u0026\u0026\n                   (p2._2.hasDebt)\n}.count\n```\n\n#### Comparing our 3 methods\n\nFastest to Slowest: Solution 2, Solution 1, Solution 3\n\n- Cartesian product (Solution 3) is _extremely_ slow. Use inner join instead.\n- Filtering data first before join (Solution 2) is much faster than joining then filtering (Solution 1)\n\n\n### Types of Data\n\n- Structured: Database tables\n- Semi-Structured: JSON, XML - these types of data are self-describing. No rigid structure to them.\n- Unstructured: Log files, images\n\nFor structured data, Spark may be able to make optimizations for you (such as putting filters before inner joins). That is the whole point of _Spark SQL_. The only caveat is we've got to give up some of the freedom, flexibility, and generality of the functional collections API in order to give Spark some structure and thus more opportunities to optimize.\n\n\n## Spark SQL\n\nSpark SQL is a library implemented on top of Spark.\n\n#### Benefits\n\n- __mix SQL queries with Scala__ - sometimes it's more desirable to express a computation in SQL syntax instead of functional APIs, and vice versa.\n- __high performance__ - we get optimizations we're used to from databases, into our Spark jobs.\n- __support new data sources such as semi-structured data and external databases__\n\n\n#### 3 main APIs it adds\n- SQL literal syntax\n- DataFrames\n- Datasets\n\n#### 2 specialized backend components\n- Catalyst - a query optimizer.\n- Tungsten - off-heap serializer.\n\nMore info on all this later.\n\n### SparkSession\n\n`SparkSession` is the newer version of `SparkContext`. This is how to create a `SparkSession`:\n\n```scala\nimport org.apache.spark.sql.SparkSession\n\nval spark = SparkSession\n  .builder()\n  .appName(\"My App\")\n  //.config(\"spark.some.config.option\", \"some-value\")\n  .getOrCreate()\n```\n\n### Creating DataFrames\n\nA _DataFrame_ is conceptually equivalent to a table in a relational database.\n\nDataFrames are distributed collections of records, with a known schema.\n\nThere are 2 ways to create data frames:\n\n1. From an existing RDD - either with schema inference, or with an explicit schema\n1. Reading a data source from file - common structured or semi-structured formats such as JSON\n\n#### Method 1: Use an existing RDD:\n\n```scala\nval tupleRDD = ... // Assume RDD[(Int, String, String, String)]\nval tupleDF = tupleRDD.toDF(\"id\", \"name\", \"city\", \"country\") // column names\n```\n\nIf you don't pass column names to `toDF`, then Spark will assign numbers as attributes (column names).\n\nHowever, if you have an RDD containing some kind of case class instance, then Spark can infer the attributes from the case class's fields:\n\n```scala\ncase class Person(id: Int, name: String, city: String)\nval peopleRDD = ... // Assume RDD[Person]\nval peopleDF = peopleRDD.toDF // Attributes (column names) will be inferred\n```\n\nAnother option is to use an explicit schema, but the process is omitted here as it's complex.\n\n#### Method 2: Use a data source from file\n\n```scala\n// 'spark' represents the SparkSession object\nval df = spark.read.json(\"examples/src/main/resources/people.json\")\n```\n\nSpark SQL can directly create `DataFrame`s from the following semi-structured/structured data: `JSON`, `CSV`, `Parquet` (a serialized big data format), `JDBC`, + more using [DataFrameReader](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader)\n\n### Creating Temp Views\n\nAssuming we have a `DataFrame` called `peopleDF`, we just have to register our `DataFrame` as a temporary SQL view first:\n\n```scala\npeopleDF.createOrReplaceTempView(\"people\")\n```\n\nThis registers the `DataFrame` as an SQL temporary view. It essentially gives a name to our DataFrame in SQL so we can refer to it in an SQL `FROM` statement:\n\n```scala\nval adultsDF = spark.sql(\"SELECT * FROM people WHERE age \u003e 17\")\n```\n\nThe SQL statements available are basically what's available in HiveQL.\n\n\n## Data Frames\n\nDataFrames API is similar to SQL, in that it has `select`, `where`, `limit`, `orderBy`, `groupBy`, `join`, etc.\n\n### Spark SQL vs Data Frames API\n\nGiven:\n\n```scala\ncase class Employee(id: Int, fname: String, lname: String, age: Int, city: String)\nval employeeDF = sc.parallelize(...).toDF\n```\n\nwe can use Spark SQL as:\n\n```scala\n// assuming we have \"employees\" table registered, we an do:\nval sydneyEmployeesDF = spark.sql(\"\"\"SELECT id, lname\n                                       FROM employees\n                                      WHERE city = \"Sydney\"\n                                   ORDER BY id\"\"\")\n```\n\nor we can use the DataFrames API as:\n\n```scala\nval sydneyEmployeesDF = employeeDF.select(\"id\", \"lname\")\n                                  .where(\"city == 'Sydney'\")\n                                  .orderBy(\"id\")\n```\n\n### Seeing our data\n\n- `show()` pretty-prints DataFrame in tabular form, showing first 20 elements\n- `printSchema()` - prints the schema of your DataFrame in tree format\n\n### 3 ways to select a column\n\n1. Use $-notation as `df.filter($\"age\" \u003e 18)`. Requires `import spark.implicits._` to use $-notation.\n1. Refer to the Dataframe: `df.filter(df(\"age\") \u003e 18)`\n1. Use SQL query string: `df.filter(\"age \u003e 18\")`\n\n### Working with missing values\n\nDropping records with unwanted values:\n\n- `drop()` drops rows that contain `null` or `NaN` values in any column and returns a new `DataFrame`\n- `drop(\"all\")` drops rows that contain `null` or `NaN` values in all columns and returns a new `DataFrame`\n- `drop(Array(\"id\", \"name\"))` drops rows that contain `null` or `NaN` values in the specified columns and returns a new `DataFrame`\n\nReplacing unwanted values:\n\n- `fill(0)` replaces all occurrences of `null` or `NaN` in __numeric columns__ with a specified value and returns a new `DataFrame`\n- `fill(Map(\"minBalance\" -\u003e 0))` replaces all occurrences of `null` or `NaN` in specified column with specified value and returns a new `DataFrame`\n- `replace(Array(\"id\"), Map(1234 -\u003e 8923))` replaces specified value (1234) in specified column (id) with specified replacement value (8923) and returns a new `DataFrame`\n\n### Common actions on DataFrames\n\nLike RDDs, DataFrames also have their own set of actions:\n\n```scala\ncollect(): Array[Row] // returns an array that contains all rows in this DataFrame\ncount(): Long // returns number of rows in DataFrame\nfirst(): Row // returns the first row in the DataFrame\nhead(): Row  // same as first()\nshow(): Unit // displays the top 20 rows of DataFrame in a tabular form\ntake(n: Int): Array[Row] // returns the first n rows in the DataFrame\n```\n\n### Joins on DataFrames\n\nJoins on DataFrames are similar to those on Pair RDDs, with 1 major difference: since DataFrames aren't key/value pairs, we must specify which columns to join on.\n\nExamples of joins - `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`:\n\n```scala\ndf1.join(df2, $\"df1.id\" === $\"df2.id\")                // inner join\ndf1.join(df2, $\"df1.id\" === $\"df2.id\", \"right_outer\") // right_outer join\n```\n\n### Optimizations on DataFrames: Catalyst\n\nCompiles Spark SQL programs down to an RDD.\n\n- __Reorders operations__ - for example, tries to do `filter`s as early as possible.\n- __Reduces the amount of data we must read__ - skips reading in, serializing, and sending around parts of the data that aren't needed for our computation (Example: a Scala object with many fields - Catalyst will only send around the relevant columns of the object).\n- __Pruning unneeded partitions__ - Analyzes `DataFrame` and filter operations to figure out and skip partitions that aren't needed in our computation.\n\n### Optimizations on DataFrames: Tungsten\n\nTungsten is\n- __highly-specialized data encoder__ - since our data types are restricted to Spark SQL data types, Tungsten can optimize encoding by using this schema information.\n- __column-based storage__ - this is common for databases. Since most operations on tables are done on columns (instead of rows), it's more efficient to store data by grouping column data together.\n- __encodes data off-heap__ - so it's free from garbage collection overhead.\n\n### Limitations of DataFrames\n\n1. `DataFrame`s are untyped (unlike RDDs). Your code may compile, but you may get a runtime exception if you try to run a query on a column that doesn't exist.\n1. If your unstructured data cannot be reformulated to adhere to some kind of schema, it would be better to use RDDs\n\n\n## Datasets\n\n`DataFrame`s don't have type safety. `Dataset`s resolve this problem.\n\n```scala\ntype DataFrame = Dataset[Row] // DataFrames are actually Datasets of type: Row\n```\n\n- Datasets can be thought of as __typed__ distributed collections of data\n- Dataset API unifies the DataFrame and RDD APIs. We can freely mix these APIs, although the function signatures may be slightly different.\n- Datasets require structured or semi-structured data.\n\n\nDataSets vs DataFrames: you get type information using DataSets. Can now use higher-order functions like `map`, `flatMap`, `filter` that datasets get from RDDs.\n\nDataSets vs RDDs: You get more optimizations than RDDs since `Catalyst` works on DataSets.\n\nMixing APIs example, assuming `listingsDS` is of type `Dataset[Listing]`:\n\n```scala\nlistingsDS.groupByKey(l =\u003e l.zip)        // from RDD API: groupByKey\n          .agg(avg($\"price\").as[Double]) // from our DataFrame API\n```\n\nThe types match up since everything is a `dataset`.\n\n\n### Creating a Dataset\n\n#### Create `dataset` from a `DataFrame`\n\n```scala\nimport spark.implicits._\nmyDF.toDS // creates a new dataset from a dataframe\n```\n\n#### Create `dataset` from JSON\n\nIf we define a case class who's structure, names, and types all match up with \"people.json\", then we can read this file into a dataset, perfectly typed:\n\n```scala\nval myDS = spark.read.json(\"people.json\").as[Person]\n```\n\n#### Create `dataset` from `RDD`\n\n```scala\nimport spark.implicits._\nmyRDD.toDS\n```\n\n#### Create `dataset` from Scala type\n\n```scala\nimport spark.implicits._\nList(\"yay\", \"ohnoes\", \"hooray!\").toDS\n```\n\n### Typed Columns\n\n`datasets` used typed columns, so the following error could happen:\n\n```scala\nfound   : org.apache.spark.sql.Column\nrequired: org.apache.spark.sql.TypedColumn[...]\n                .agg(avg($\"price\")).show\n```\n\nTo create a `TypedColumn`, we can rewrite it as `$\"price\".as[Double]` to give the column a specific type (of `Double`)\n\n\n#### Untyped and Typed Transformations\n\n- __Untyped transformations__ - exist in `DataFrame`s and `DataSet`s\n- __Typed transformations__ - exist in `Dataset`s. Typed variants of many `DataFrame` transformations, and additional transformations such as RDD-like higher-order functions like `map`, `flatMap`, etc.\n\n### Aggregators\n\nAggregators is a class that helps you generically aggregate data, kind of like the `aggregate` method in RDDs.\n\n```scala\nclass Aggregator[-IN, BUF, OUT]\n```\n\n- `IN` is the input type to the aggregator. When using an aggregator after `groupByKey`, this is the type that represents the value in the key/value pair.\n- `BUF` is the intermediate type during aggregation\n- `OUT` is the type of the output of the aggregation\n\nTo create an Aggregator, define the `IN`, `BUF`, `OUT` types and implement the below methods:\n\n```scala\nval myAgg = new Aggregator[IN, BUF, OUT] {\n  def zero: BUF = ...                    // The initial value.\n  def reduce(b: BUF, a: IN): BUF = ...   // Add an element to the running total.\n  def merge(b1: BUF, b2: BUF): BUF = ... // Merge intermediate values.\n  def finish(b: BUF): OUT = ...          // Return the final result.\n}.toColumn // if we're going to pass this to an aggregation method, it needs to be of type column\n```\n\nExample of specific Aggregator:\n\n```scala\nval keyValues\n  = List((3, \"Me\"), (1, \"Thi\"), (2, \"Se\"), (3, \"ssa\"), (1, \"sIsA\"), (3, \"ge:\"), (3, \"-)\", (2, \"cre\"), (2,\"t\"))\n\nval keyValuesDS = keyValues.toDS\n\nval strConcat = new Aggregator[(Int, String), String, String] {\n  def zero: String = \"\"\n  def reduce(b: String, a: (Int, String)): String = b + a._2\n  def merge(b1: String, b2: String): String = b1 + b2\n  def finish(r: String): String = r\n}.toColumn\n\n// pass it to our aggregator\nkeyValuesDS.groupByKey(pair =\u003e pair._1)\n           .agg(strConcat.as[String]).show\n```\n\nThe above solution now needs Encoders for it to work.\n\n### Encoders\n\nEncoders convert your data between JVM objects and Spark SQL's specialized internal representation. Encoders are required by all Datasets. They generate custom bytecode for serialization and deserialization of your data.\n\nTwo ways to introduce encoders:\n1. __Automatically__ (generally the case) via implicits from a `SparkSession`. Just do  `import spark.implicits._`\n1. __Explicitly__ via `org.apache.spark.sql.Encoders` which contains a large selection of methods for creating `Encoder`s from Scala primitive types, `Product`s, tuples.\n\nWe _explicitly_ add encoders to our `strConcat` function above, by adding these 2 functions:\n\n```scala\noverride def bufferEncoder: Encoder[String] = Encoders.STRING\noverride def outputEncoder: Encoder[String] = Encoders.STRING\n```\n\n### When to use Datasets vs DataFrames vs RDDs\n\n- Use Datasets when\n    - you have structured/semi-structured data\n    - you want typesafety\n    - you need to work with functional APIs\n    - you need good performance, but it doesn't have to be the best\n- Use DataFrames when\n    - you have structured or semi-structured data\n    - you want the best possible performance, automatically optimized for you\n- Use RDDs when\n    - you have unstructured data\n    - you need to fine-tune and manage low-level details of RDD computations\n    - you have complex data types that cannot be serialized with `Encoder`s\n\n# [User Defined Functions (UDFs)](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html)\n\nUser Defined Functions (UDFs) is a feature of Spark SQL to define new [Column](https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Column.html)-based functions for transforming [Datasets](https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/Dataset.html)\n\nInstead of UDFs, use [higher-level standard Column-based functions](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions.html) whenever possible since Spark SQL performs optimizations on them. Spark SQL does not perform optimizations on UDFs.\n\nExample of UDF:\n\n```scala\nval dataset = Seq((0, \"hello\"), (1, \"world\")).toDF(\"id\", \"text\")\n\nval upper: String =\u003e String = _.toUpperCase // regular Scala function\n\n// Define a UDF that wraps the upper Scala function defined above.\n// You could instead define the function inside the udf but separating\n// Scala functions from Spark SQL's UDFs allows for easier testing.\nimport org.apache.spark.sql.functions.udf\nval upperUDF = udf(upper)\n\n// Apply the UDF to change the source dataset\ndataset.withColumn(\"upper\", upperUDF('text)).show\n```\n\ngives output of:\n\n```Scala\n+---+-----+-----+\n| id| text|upper|\n+---+-----+-----+\n|  0|hello|HELLO|\n|  1|world|WORLD|\n+---+-----+-----+\n```\n\nAlternatively you could have defined the UDF like this:\n\n```scala\nval upper: String =\u003e String = _.toUpperCase\nval upperUDF = udf { s: String =\u003e s.toUpperCase }\n```\n\nor like this:\n\n```scala\nval upper: String =\u003e String = _.toUpperCase\nval upperUDF = udf[String, String](_.toUpperCase)\n```\n\nYou can also register UDFs so you can use them in SQL queries:\n\n```scala\nval spark: SparkSession = ...\nspark.udf.register(\"myUpper\", (input: String) =\u003e input.toUpperCase)\n```\n\n\n# SparkException: Task not serializable\n\n`org.apache.spark.SparkException: Task not serializable` exception occurs when you use a reference to an instance of a non-serializable class inside a transformation.\n\n[Functions on RDDs (such as `map`), Dataframes, Datasets, etc. need to be serialized so they can be sent to worker nodes. Serialization happens for you, but if the function makes a reference to a field in another object, the entire other object must be serialized.](https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54)\n\n### Example 1\n\n```scala\nobject Example {\n  val num = 1\n  def myFunc = testRdd.map(_ + num)\n}\n```\n\nThis code fails since `num` is outside the scope of `myFunc()`. Since \"the function makes a reference to a field in another object, the entire other object must be serialized.\"\n\nThe code is fixed by adding `extends Serialiable` to the object:\n\n```scala\nobject Example extends Serializable {\n  val num = 1\n  def myFunc = testRdd.map(_ + num)\n}\n```\n\n### Example 2\n\n```scala\nobject Example {\n  val num = 1\n  def myFunc = {\n    val enclosedNum = num\n    testRdd.map(_ + enclosedNum)\n  }\n}\n```\n\nInstead of using `extends Serializable` to serialize the entire object, this code works since we added `val enclosedNum = num`. Now the entire object doesn't need to be serialized since `enclosedNum` is in the scope of `myFunc()`\n\nHowever, if we used `lazy val enclosedNum = num` instead, it wouldn't work. When `enclosedNum` is referenced, it still requires knowledge of `num` so it will still try to serialize `object Example`.\n\n\n# References\n\n#### References - Used in this Repo\n\n- YouTube: [Apache Spark Tutorial | Spark Tutorial for Beginners | Spark Big Data | Intellipaat](https://www.youtube.com/watch?v=GFC2gOL1p9k) - 0:00 to 33:20 was great. The rest was skipped since it taught very specific concepts with a mediocre explanation.\n- Coursera: [Big Data Analysis with Scala and Spark](https://www.coursera.org/learn/scala-spark-big-data?specialization=scala) - an amazing course. This repo is based on the course's lecture videos.\n- Article: [Spark SQL UDFs](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html) - good beginner summary of UDFs.\n- Article: [Serialization  with Spark and Scala](https://medium.com/onzo-tech/serialization-challenges-with-spark-and-scala-a2287cd51c54) - useful for understanding `SparkException: Task not serializable`. The 8 examples were good, but the \"What's next\" section was skipped since it got overly detailed and complicated.\n\n#### References - Deprecated\n\n- [YouTube: What is Apache Spark? | Introduction to Apache Spark | Apache Spark Certification | Edureka](https://www.youtube.com/watch?v=VSbU7bKfNkA\u0026list=PL9ooVrP1hQOGyFc60sExNX1qBWJyV5IMb) - Mediocre overview.\n- [YouTube: Intro to Apache Spark for Java and Scala Developers - Ted Malaska (Cloudera)](https://www.youtube.com/watch?v=x8xXXqvhZq8) - Too high-level and slightly off-topic.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodneyshag%2Fspark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frodneyshag%2Fspark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodneyshag%2Fspark/lists"}