{"id":20271740,"url":"https://github.com/rodneyshag/parallel_programming","last_synced_at":"2025-03-04T00:05:15.704Z","repository":{"id":123421849,"uuid":"210265840","full_name":"RodneyShag/Parallel_programming","owner":"RodneyShag","description":"Basics of parallel programming in Scala","archived":false,"fork":false,"pushed_at":"2019-10-14T23:13:39.000Z","size":177,"stargazers_count":8,"open_issues_count":0,"forks_count":6,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-14T05:49:47.656Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RodneyShag.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-23T04:40:01.000Z","updated_at":"2023-05-03T13:05:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"e7e29266-2b63-4327-b6e6-03503e03eeea","html_url":"https://github.com/RodneyShag/Parallel_programming","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FParallel_programming","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FParallel_programming/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FParallel_programming/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FParallel_programming/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RodneyShag","download_url":"https://codeload.github.com/RodneyShag/Parallel_programming/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241758964,"owners_count":20015251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T12:39:15.354Z","updated_at":"2025-03-04T00:05:15.664Z","avatar_url":"https://github.com/RodneyShag.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Parallel programming\n\n- [Introduction](#introduction)\n- [Threads](#threads)\n- [Atomicity, Synchronization, Deadlock](#atomicity-synchronization-deadlock)\n- [Parallel Algorithms](#parallel-algorithms)\n- [Runtimes](#runtimes)\n- [Benchmarking](#benchmarking)\n- Implementation\n    - [Copy](#copy-implementation)\n    - [Map](#map-implementation)\n    - [Reduce](#reduce-implementation)\n- [Data-Parallel Programming](#data-parallel-programming)\n- [Scala Collections](#scala-collections)\n- [Splitters and Combiners](#splitters-and-combiners)\n- [Parallel Two-Phase Construction](#parallel-two-phase-construction)\n- [References](#references)\n\n\nThis repo is a concise summary and _replacement_ of the [Coursera: Parallel programming](https://www.coursera.org/learn/parprog1?specialization=scala) course. Using the hyperlinks below is optional.\n\n\n## Introduction\n\n__Parallel computing__ is a type of computation in which many calculations are performed at the same time.\n\n__CPU speed-up is non-linear__ - The power required to speed up a CPU starts to grow non-linearly. For this reason, we add more CPUs instead of trying to speed up just 1 CPU.\n\n\n## Threads\n\n__Thread memory address space__ - Threads can be started from within the same program, and they share the same memory.\n\nExample using threads:\n\n```scala\nclass HelloThread extends Thread {\n  override def run() {\n    println(\"Hello world!\")\n  }\n}\n\nval t = new HelloThread\n\nt.start()\nt.join()\n```\n\nCalling `t.join()` will block execution on the main thread until `HelloThread` completes\n\n\n## Atomicity, Synchronization, Deadlock\n\nAtomicity - An operation is _atomic_ if it appears as if it occurred instantaneously from the point of view of other threads.\n\nthe `synchronized` keyword is used to acheive atomicity.\n\nExample of deadlock with 2 synchronized threads:\n\n```scala\nclass Account(private var amount: Int = 0) {\n  def transfer(target: Account, n: Int) =\n    this.synchronized {\n      target.synchronized {\n        this.amount -= n\n        target.amount += n\n      }\n    }\n  }  \n}\n```\n\n```scala\ndef startThread(a: Account, b: Account, n: Int) = {\n  val t = new Thread {\n    override def run() {\n      for (i \u003c- 0 until n) {\n        a.transfer(b, 1)\n      }\n    }\n  }\n  t.start()\n  t\n}\n```\n\n```scala\nval a1 = new Account(500000)\nval a2 = new Account(700000)\n\nval t = startThread(a1, a2, 150000)\nval s = startThread(a2, a1, 150000)\nt.join()\ns.join()\n```\nDeadlock occurred since the Account `a1` is waiting for Account `a2` to release the lock, and Account `a2` is waiting for Account `a1` to release the lock.\n\nThis deadlock can be fixed by ensuring a lock is grabbed for the smaller account number before the larger account number.\n\n```scala\nclass Account(private var amount: Int = 0) {\n  val uid = getUniqueUid()\n  private def lockAndTransfer(target: Account, n: Int) =\n    this.synchronized {\n      target.synchronized {\n        this.amount -= n\n        target.amount += n\n      }\n    }\n\n  def transfer(target: Account, n: Int) =\n    if (this.uid \u003c target.uid) this.lockAndTransfer(target, n)\n    else target.lockAndTransfer(this, -n)\n}\n```\n\n\n## Parallel Algorithms\n\nThe `parallel` keyword will parallelize 2 calls. It is defined as:\n\n```scala\ndef parallel[A, B](taskA: =\u003e A, taskB: =\u003e B): (A, B) = { ... }\n```\n\nScala \"by-name\" parameters `=\u003e` are used to achieve lazy evaluation of `A` and `B`, so that they can be evaluated in parallel instead of when the function is initially called.\n\nHere is a recursive algorithm for an unbounded number of threads. It calculates the _p-norm_ for a 2-D vector:\n\n```scala\ndef pNormRec(a: Array[Int], p: Double): Int =\n  power(segmentRec(a, p, 0, a.length), 1/p)\n\ndef segmentRec(a: Array[Int], p: Double, s: Int, t: Int) = {\n  if (t - s \u003c threshold) {\n    sumSegment(a, p, s, t) // small segment: do it sequentially\n  } else {\n    val m = (s + t) / 2\n    val (sum1, sum2) = parallel(segmentRec(a, p, s, m), segmentRec(a, p, m, t))\n    sum1 + sum2\n  }\n}\n```\n\nHere is an example of 4 threads in parallel:\n\n```scala\nval ((part1, part2), (part3, part4)) =\n  parallel(parallel(sumSegment(a, p, 0, mid1),\n                    sumSegment(a, p, mid1, mid2)),\n           parallel(sumSegment(a, p, mid2, mid3),\n                    sumSegment(a, p, mid3, a.length)))\npower(part1 + part2 + part3 + part4, 1/p)\n```\n\n\n## Runtimes\n\nIf we sum all integers in an array of length `n`, our runtime is `O(n)` if it's done sequentially.\n\nAssuming enough CPUs are available for full parallelization, then the code takes \"tree\" form where the runtime is the same as the height of the tree: O(log(n))\n\n```\n                      sum(0 to 8)\n      sum(0 to 4)                   sum(5 to 8)           // both done in parallel\nsum(0 to 2)  sum(3 to 4)      sum(5 to 6)  sum(7 to 8)    // all 4 done in parallel\n```\n\nIf we didn't have enough CPUs for full parallelization, the runtime would be O(n)\n\n\n## Benchmarking\n\nA naive approach for benchmarking is to do:\n\n```scala\nval xs = List(1, 2, 3)\nval startTime = System.nanoTime\nxs.reverse\nprintln((System.nanoTime - startTime) / 1000000)\n```\n\nThe above method can be improved by\n- doing Multiple repetitions\n- statistical treatment - computing mean and variance\n- Eliminating outliers\n- Ensuring steady state (warm-up). This can be achieved by using a tool called _ScalaMeter_\n- Preventing anomalies (Garbage Collection, Just-in-time compilation)\n\n\n## Copy: Implementation\n\n```scala\ndef copy(src: Array[Int], target: Array[Int], from: Int, until: Int, depth: Int): Unit = {\n  if (depth == maxDepth) {\n    Array.copy(src, from, target, from, until - from)\n  } else {\n    val mid = (from + until) / 2\n    val right = parallel(\n      copy(src, target, mid, until, depth + 1),\n      copy(src, target, from, mid, depth + 1)\n    )\n  }\n}\n```\n\n\n## Map: Implementation\n\n- __List__: Not good for parallel implementations because\n    - difficult to split them in half (need to search for the middle)\n    - concatenation lists takes linear time\n- __Arrays__ and __trees__ work well.\n\n#### Sequential Map\n\n```scala\ndef mapASegSeq[A, B](inp: Array[A], left: Int, right: Int, f : A =\u003e B,\n                     out: Array[B]) = {\n  var i = left\n  while (i \u003c right) {\n    out(i) = f(inp(i))\n    i = i + 1\n  }\n}\n```\n\nTesting it gives:\n\n```scala\n// Input\nval in = Array(2, 3, 4, 5, 6)\nval out = Array(0, 0, 0, 0, 0)\nval f = (x: Int) =\u003e x * x\nmapASegSeq(in, 1, 3, f, out)\nout\n\n// Output\nres1: Array[Int] = Array(0, 9, 16, 0, 0)\n```\n\n#### Parallel Map\n\n- The base case is a sequential map.\n- The recursive case is a parallel recursive map\n\n```scala\ndef mapASegSeq[A, B](inp: Array[A], left: Int, right: Int, f : A =\u003e B,\n                     out: Array[B]) = {\n  if (right - left \u003c threshold) {\n    mapASegSeq(inp, left, right, f, out)\n  } else {\n    val mid = (left + right) / 2\n    parallel(mapASegPar(inp, left, mid, f, out),\n             mapASegPar(inp, mid, right, f, out))\n  }\n}\n```\n\n#### Parallel Map on Tree\n\n- The base case is a sequential map.\n- The recursive case is a parallel recursive map\n\n```scala\ndef mapTreePar[A:Manifest, B:Manifest](t: Tree[A], f: A =\u003e B) : Tree[B] =\n  t match {\n    case Leaf(a) =\u003e {\n      val len = a.length;\n      val b = new Array[B](len)\n      var i = 0;\n      while (i \u003c len) {\n        b(i) = f(a(i))\n        i = i + 1\n      }\n      Leaf(b)\n    }\n    case Node(l, r) =\u003e {\n      val (lb, rb) = parallel(mapTreePar(l, f), mapTreePar(r, f))\n      Node(lb, rb)\n    }\n  }\n```\n\n\n## Reduce: Implementation\n\nSubtraction is not associative, so the following results are different:\n\n```scala\nList(1, 3, 8).foldLeft(100)((s, x) =\u003e s - x) == ((100 - 1) - 3) - 8 == 88\nList(1, 3, 8).foldRight(100)((s, x) =\u003e s - x) == 1 - (3 - (8 - 100)) == -94\nList(1, 3, 8).reduceLeft(100)((s, x) =\u003e s - x) == (1 - 3) - 8 == -10\nList(1, 3, 8).reduceRight(100)((s, x) =\u003e s - x) == 1 - (3 - 8) == 6\n```\n\n#### Sequential Reduce on Tree\n\n```scala\ndef reduce[A](t: Tree[A], f : (A, A) =\u003e A): A = t match {\n  case Leaf(v) =\u003e v\n  case Node(l, r) =\u003e f(reduce[A](l, f), reduce[A](r, f))\n}\n```\n\n#### Parallel Reduce on Tree\n\n```scala\ndef reduce[A](t: Tree[A], f : (A, A) =\u003e A): A = t match {\n  case Leaf(v) =\u003e v\n  case Node(l, r) =\u003e} {\n    val (lV, rV) = parallel(reduce[A](l, f), reduce[A](r, f))\n    f(lV, rV)\n  }\n}\n```\n\n#### Parallal Reduce on Array\n\n```scala\ndef reduceSeg[A](inp: Array[A], left: Int, right: Int, f: (A, A) =\u003e A): A = {\n  if (right - left \u003c threshold) {\n    var res = inp(left)\n    var i = left + 1\n    while (i \u003c right) {\n      res = f(res, inp(i))\n      i = i + 1\n      res\n    }\n  } else {\n    val mid = (left + right) / 2\n    val (a1, a2) = parallel(reduceSeg(inp, left, mid, f),\n                            reduceSeg(inp, mid, right, f))\n    f(a1, a2)\n  }\n}\n\ndef reduce[A](inp: Array[A], f: (A, A) =\u003e A): A =\n  reduceSeg(inp, 0, inp.length, f)\n```\n\n\n## Data-Parallel Programming\n\nWe use the `.par` function to convert a range to a parallel range. Iterations of the parallel loop will be executed on different processers. A parallel `for` loop does not return a value. It can only interact with the rest of the program by performing a side effect, such as writing to an array. This is only correct if iterations of the `for` loop write to separate memory locations, or use some form of synchronization.\n\nThe following code is correct:\n\n```scala\ndef initializeArray(xs: Array[Int])(v: Int): Unit = {\n  for (i \u003c- (o until xs.length).par) {\n    xs(i) = v\n  }\n}\n```\nBut if we had changed `xs(i) = v` to `xs(0) = i`, the code would be incorrect since we would be trying to access the same entry in an array in multiple iterations of the `for` loop.\n\nScala collections can be converted to parallel collections by invoking the `.par` method:\n\n```scala\n(1 until 1000).par\n  .filter(n =\u003e n % 3 == 0)\n  .count(n =\u003e n.toString == n.toString.reverse)\n```\n\nImplementation of `sum` in parallel:\n\n```scala\ndef sum(xs: Array[int]) : Int = {\n  xs.par.fold(0)(_ + _)\n}\n```\n\nImplementation of `max` in parallel:\n\n```scala\ndef max(xs: Array[Int]): Int = {\n  xs.par.fold(Int.MinValue)(math.max) // or rewrite math.max as: (x, y) =\u003e if (x \u003e y) x else y\n}\n```\n\nFor the previous 2 examples, `fold` worked out for us since the functions we provided (`+` and `math.max`) were associative. The benefit of `fold` (as compared to `foldLeft` or `foldRight`) is that `fold` can run in parallel.\n\n\n#### Counting Vowels\n\n`def foldLeft[B](z: B)(f: (B, A) =\u003e B): B` or `def fold(z: A)(f: (A, A) =\u003e A): A` can be to count vowels. We can use `aggregate` instead, which is like a combination of the 2 functions.\n\n```scala\ndef aggregate[B](z: B)(f: (B, A) =\u003e B, g: (B, B) =\u003e B): B\n```\n\n```scala\nArray('E', 'P', 'F', 'L').par.aggregate(0)(\n  (count, c) =\u003e if (isVowel(c)) count + 1 else count,\n  _ + _\n)\n```\n\n![Aggregate function](./images/aggregate.png)\n\nEach processor will use `f: (B, A) =\u003e B` to do its calculation. The resulting calculations from the processors are combined using `g: (B, B) =\u003e B`. `aggregate` is a very general operation.\n\n\n## Scala Collections\n\n![Collection Hierarchy](./images/collectionHierarchy.png)\n\n#### Scala Collections Hierarchy\n\n- `Traversable[T]` - collection of elements with type `T`, with operations implemented using `foreach`\n- `Iterable[T]` - collection of elements with type `T`, with operations implemented using `iterator`. Subtype of `Traversable[T]`, containing more collection methods.\n- `Seq[T]` - An ordered sequence of elements with type `T`, where every element is assigned to an index. Subtype of `Iterable[T]`\n- `Set[T]` - A set of elements with type `T` (no duplicates)\n- `Map[K, V]` - a map of keys with type `K` associated with values of type `V` (no duplicate keys)\n\n#### Parallel Collections\n\n- `ParIterable[T]`, `ParSeq[T]`, `ParSet[T]`, `ParMap[K, V]`,\n- `ParArray[T]` - parallel array of objects, counterpart of `Array` and `ArrayBuffer`\n- `ParRange` - parallel range of integers, `counterpart of Range`\n- `ParVector[T]` - parallel vector, counterpart of immutable `Vector`\n- `immutable.ParHashSet[T]` - counterpart of `immutable.HashSet`\n- `immutable.ParHashMap[T]` - counterpart of `immutable.HashMap`\n- `mutable.ParHashSet[T]` - counterpart of `mutable.HashSet`\n- `mutable.ParHashMap[T]` - counterpart of `mutable.HashMap`\n- `ParTrieMap[K, V]` - thread-safe parallel map with atomic snapshots, counterpart of `TrieMap`\n\n#### Generic Collections\n\nThese are collections that can have code that can be executed either in sequential or parallel: `GenIteratble[T]`, `GenSeq[T]`, `GenSet[T]`, `GenMap[K, V]`\n\n`15251` is a sample palindrome. `largestPalindrome` searches for the largest palindrome in a sequence:\n\n```scala\ndef largestPalindrome(xs: GenSeq[Int]): Int = {\n  xs.aggregate(Int.MinValue) (\n  \t(largest, n) =\u003e\n  \tif (n \u003e largest \u0026\u0026 n.toString == n.toString.reverse) n else largest,\n  \tmath.max\n  )\n}\nval array = (0 until 1000000).toArray\n\nlargestPalindrome(array) // invoke sequentially\nlargestPalindrome(array.par) // invoke parallelly\n```\n\n#### Accessing same memory locations\n\nRule: Avoid mutations to the same memory locations without proper synchronization\n\nThe following code won't since using `+=` on `mutable.Set[Int]` may modify the memory locations.\n\n```scala\ndef intersection(a: GenSet[Int], b: GenSet[Int]): Set[Int] = {\n  val result = mutable.Set[Int]()\n  for (x \u003c- a) if (b contains x) result += x\n  result\n}\nintersection((0 until 1000).toSet, (0 until 1000 by 4).toSet)\nintersection((0 until 1000).par.toSet, (0 until 1000 by 4).par.toSet)\n```\n\nThe program can be fixed by replacing `mutable.Set[Int]` with Java's `new ConcurrentSkipListSet[Int]()`\n\nA smarter way to solve the problem is to recode the method to use `filter` instead of creating a new `Set`\n\n```scala\ndef intersection(a: GenSet[Int], b: GenSet[Int]): GenSet[Int] = {\n  if (a.size \u003c b.size) a.filter(b(_))\n  else b.filter(a(_))\n}\n```\n\nAnother Rule: Never modify a parallel collection on which a data-parallel operation is in progress\n\n\n## Splitters and Combiners\n\n#### Splitter\n\nA splitter can be split into more splitters that traverse over disjoint subsets of elements.\n\n```scala\ntrait Splitter[A] extends Iterator[A] {\n\tdef split: Seq[Splitter[A]]\n\tdef remaining: Int\n}\n```\nEvery parallel collection has its own `Splitter` implementation. Splitting is done multiple times during execution of a parallel operation, so it should be `O(log n)` or better runtime for us to benefit from parallelization.\n\n#### Combiner\n\n```scala\ntrait Combiner[A, Repr] extends Builder[A, Repr] {\n  def combine(that: Combiner[A, Repr]): Combiner[A, Repr]\n}\n```\n\n- combines 2 combiner objects into 1 combiner\n- Should be `O(log n + log m)` or better runtime\n- When collection is a set or map, combine represents a _union_\n- When collection is a sequence, combine represents _concatenation_\n\n\n## Parallel Two-Phase Construction\n\nLet us discuss two-phase construction for arrays.\n\n#### Combiner: Requirements\n\nAchieving faster than `O(n)` runtime is not possible for `combine` if we use a standard array.\n\nIn Two-Phase construction, the combiner has an intermediate data structure as its internal representation\n\n1. For `+=`, the intermediate data structure should have an efficient runtime\n1. For `combine`, the intermediate data structure should have O(log n + log m) or better runtime\n1. For intermediate data structure, it must be possible to convert to the resulting data structure in O(n/P) time. That is, the conversion must be parallelizable\n\n#### Combiner: Implementation\n\nHere is my mediocre explanation of the process:\n\nLet `P` be the number of processors. Use an array of `P` arrays. Think of it as an array of pointers to arrays.\n\n1. `+=` is O(1) amortized time to add 1 element to the end of 1 of the nested arrays.\n1. `combine` is O(P) runtime to copy references to arrays, instead of copying arrays.\n1. Converting to the resulting data structure can be done in parallel in O(n/P) time since we are writing to `n/P` different parts of the array in parallel.\n\n\n## References\n\n[Coursera: Parallel programming](https://www.coursera.org/learn/parprog1?specialization=scala) - Notes are based on this tutorial.\n\n- Week 1 - Good videos\n- Week 2 - Videos 1-3 were good. Videos 4-5 were too mathematical. Video 6: `scan` example was too advanced for this tutorial.\n- Week 3 - Good videos\n- Week 4 - Videos 1-2 were good. Videos 3-5: Conc-trees were too advanced for this tutorial.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodneyshag%2Fparallel_programming","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frodneyshag%2Fparallel_programming","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodneyshag%2Fparallel_programming/lists"}